Sei sulla pagina 1di 11

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2018.2867221, IEEE Sensors
Journal
1

Toward User-Independent Emotion Recognition


using Physiological Signals
Amani Albraikan1,3 , Diana P. Tobón2 , and Abdulmotaleb El Saddik1

1
Multimedia Computing Research Laboratory (MCRLab), University of Ottawa, ON, Canada
2 School of Electrical and Engineering Department, Universidad del Valle, Cali,
Colombia
3 Department of Computer Science, Princess Nourah bint Abdulrahman University, Saudi Arabia
Emails: aalbr012@uottawa.ca; tobon.diana@correounivalle.edu.co; elsaddik@uottawa.ca

Abstract—Many techniques have been developed to improve of emotion. Basic emotion theories posit that all feelings can
the flexibility and the fit of detection models beyond user- be derived from a limited set of universal and innate basic
dependent models, yet detection tasks continue to be complex emotions showing distinct patterns of psychophysiology that
and challenging. For emotion, which is known to be highly
user-dependent, improvements to the emotion learning algorithm are cross-cultural and user-defined. This phenomena is referred
can greatly boost predictive power. Our aim is to improve the to as stimulus-response specificity (SRS). This view argues
accuracy rate of the classifier using peripheral physiological that different emotions are associated with unique patterns
signals. Here, we present a hybrid sensor fusion approach based of physiological changes [3]. Conversely, appraisal theories
on a stacking model that allows for data from multiple sensors suggest a more flexible and dynamic mechanism that takes
and emotion models to be jointly embedded within a user-
independent model. WMD-DTW, which is a weighted multi- into account the interactions of stimuli, the needs and goals
dimensional DTW, and the k-nearest neighbors algorithm k-NN of the observer (top-down processes). Constructivist theories
are used to classify the emotions as a base model. The ensemble have expanded upon the appraisal theories, focusing on the
methods were used to learn a high-level classifier on top of the constraining bottom-up effects of mental representations and
two base models. We applied a meta-learning approach to the available concepts to the final emotion decision [4]. The latter
dataset and showed that the ensemble approach outperforms
any individual method. We also compared the results using two two theories take into account individual response specificity
datasets. Our proposed system achieved overall accuracies of (IRS).
65.6% for all users for the E4-dataset and 94.0% and 93.6% for The classification of emotion has mainly been studied based
recognizing valence and arousal emotions, respectively, using the on two fundamental viewpoints: SRS and IRS. For the first
MAHNOB dataset. view, SRS is an atomic discrete circuit. All components and re-
Index Terms—emotion recognition, physiological signals, multi- sponses must be aligned with each other focusing on emotional
dimensional dynamic time warping (MD-DTW), optimization, responses that are typically beyond an individuals control [3].
ensemble learning
The autonomic measurements include the facial expressions
(e.g., smiling) and the physiological responses (e.g., sweating)
I. I NTRODUCTION that are caused by changes in the central nervous system or
The understanding of human behavior and cognition has peripheral nervous system. However, emotions are a complex
begun to influence the fields of economics, finance, account- state of feeling for human beings, and the physiological signals
ing, law, and marketing, among others. When one experiences related to these emotions may vary from one person to another.
the cognitive awareness that emotions can affect decisions, Individual differences represent one of the main challenges in
better decisions will be made, thus leading to fewer negative emotion detection. Individual differences in emotional thresh-
consequences. Human decision-making often includes a com- olds lead to individual differences in patterns of behavior
plex stretch of emotions and rationalizations. According to the [5]. The main reasons behind individual differences are the
Gallup 2017 Global Emotions Report, 70% of human behavior dynamic nature of emotions and the perception difference
is based on emotions, not reasoning [1]. Cognitive psychology of a stimulus. The componential architecture depends on the
studies the mental processes that underlie behavior, includ- person, the situation, and the time [6].
ing thinking, deciding, and reasoning. These topics cover Alternatively, the IRS model was proposed by James A.
a wide range of research domains, including the workings Russell in 1980 [7]. In the dimensional approach, emotions
of memory, attention, perception, knowledge representation, can be described based on three dimensions: pleasure, arousal
reasoning, creativity, and problem-solving [2]. Statistics show and dominance (PAD). In 1974, Mehrabian and Russell [8]
the importance of developing the emotion recognition system, established a major scale designed to measure the combi-
which can improve quality of life and decision-making ability. nation of these dimensions. Self-reports focus on emotions
When comparing the roles of emotional stimuli and per- experienced in response to a given stimulus by describing the
ception, different influential psychological theories provide orthogonal continuums from pleasure to displeasure, activa-
clues about the mechanisms through which emotions serve tion to deactivation, and controlling to dominant. Self-report
to organize adaptive responses. The main approaches are measures are a necessary tool for subjective feelings, but any
basic theories, appraisal theories, and constructivist theories self-report methods are vulnerable to validity and reliability

1558-1748 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2018.2867221, IEEE Sensors
Journal
2

issues. Although most authors report their emotion scales to respiration (EDR) signals and through the use of principal
be sufficiently reliable, rating scales are one of the reliability component analysis (PCA) for dimensional reduction [24].
limitations. The most critical limitation involves response bias; However, when using the dimensional approach, it is not
for example, respondents may be unwilling to report their necessary to map dimensional spaces into a specific emotion,
emotions because of social desirability bias, especially for as this approach captures what emotions have in common but
sensitive topics (e.g., erotica and charity) [9]. Respondents not what is unique to a specific emotion. The dimensional
may also be unable to report their emotions because they are approach can capture a particular aspect of ones internal state
not aware of exactly how they feel. Furthermore, people may but not the whole of emotion. To overcome this problem, a
overestimate the negativity of their reactions when they are hyper theory was postulated combining both dimensions and
anticipating negative events [10]. The quality of the self-report emotion categories [16]. By using the hyper theory, most of the
will affect the accuracy of the results. The introspective ability current emotion recognition systems map the two dimensions
may lead to a consistent response to the experiment [11]. into two or three classes within arousal-valence areas. When
Both SRS and IRS can influence physiological response using two classification schemes, the classes are high and
patterns [12]. Thus, a physiological response can be seen as low for arousal and positive and negative for valence. When
a sum of IRS and SRS [13]. On one hand, there are large there are three classes, the classes are calm, medium and
differences among individuals in how these two fundamental activated for arousal and unpleasant, neutral and pleasant
dimensions of affect are related to a persons experience [14]. for valence [23], [30]. For example, fear and anger share
On the other hand, the quality of self-report may affect the unpleasant and aroused feelings but differ in the external
accuracy of the result [11]. Therefore, we proposed a method causes and behavioral reactions, yet the model may not capture
that uses both emotion models. To get the best of both worlds, the difference between them.
we proposed multimodal emotion detection using both discrete Most hyper-approaches use four classes to map emotions
and dimensional models. The first model used was the model into each different quadrant of the two dimensional plane [19],
by Ekman [15], which utilizes the physiological signals to map [20], [25], [28]. Other works have reported an accuracy of 70%
reactions to the emotion’s stimuli labeling. The second model for classifying three emotions (calm, positively excited, and
used was a hyper model by Russell [16] [17], which utilizes negatively excited) [18], 75% for classifying four emotions
self-report labeling for induced emotion annotation. (neutral, sadness, fear and pleasure) [25], 45% for classifying
The remainder of this paper is organized as follows: Section six emotions (amusement, contentment, disgust, fear, neutral,
II presents related work, and Section III describes the materials and sadness) [21], and 50% for classifying nine emotions
and methods used in this study, including physiological data (anger, interest, contempt, disgust, distress, fear, joy, shame,
collected during emotion induction, the experimental setup, and surprise) [22]. Clearly, the complexity is higher when
data analysis, and classification methods. Preliminary results the model handles more emotions, and thus, the accuracy is
are presented in Section IV. Finally, the study conclusions and reduced. Table I summarizes related works using physiological
future work are described in Section V. signals.
To understand individual differences in human emotions
II. R ELATED WORK and affective phenomena, we must understand and unravel the
Emotions can be recognized based on four categories: componential and dynamic nature of emotion and how the
audio-visual information, physiological signals, tactile per- componential architecture is dependent on time, the person,
ception, and multimodal form [31]. Recent work has been and the situation [6]. In [34], the authors highlight the influ-
performed to improve the identification of human emotion ence of brain activity, and the individual differences include
recognition in real-world scenarios: heterogeneous images personality, dispositional effect, biological sex, and genotype.
from social networks and active learning for facial emotion Recent studies have focused on how emotion components
were used in [32]; multi-directional regression and ridgelet and their constellations dynamically unfold over time [35],
transform for audio-visual information was proposed in [33]; [36]. In [37], the participants showed sizeable individual
and enhanced prediction of a users depression was achieved by differences in the duration of their emotions. In [38], the
collecting emotional information using a mobile questionnaire authors showed individual differences in the variability of
to provide a recommendation for users with depression-related emotional intensity, while in [39] the authors showed the
emotions. In this study, we used physiological signal-based degree of synchronicity that emotion components display over
emotion recognition because it is computationally less com- time. Time is the main element that explains the divergent
plex compared to other methods [31]. This method can provide dynamics of emotions across individuals. This result highlights
real-time monitoring of emotions at any time and any place the importance of segmentation in the implementation process,
providing the user with more mobility and freedom. in feature selection, and in the classification method that can
In the literature, most of the work explains accuracy using address the dynamic nature of emotion. In [25], the authors
pleasure and arousal [19], [29], [27], [26]. Since this method improved the accuracy by clustering the users into three groups
was first proposed, it has been widely employed in a large based on their IRS and SRS. The corresponding emotion
number of studies on emotion. The highest accuracy was recognizers were built according to these groups, and a new
approximately 96% for arousal and 94% for valence using very user would be generalized to the recognizer constructed by
high volumes of features extracted from heart rate variabil- the similar group response. In [22], the system trained the
ity (HRV), respiration (RSP) and electrocardiogram-derived data into separate subject models and then used KNN to

1558-1748 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2018.2867221, IEEE Sensors
Journal
3

Related Predicted emotions Physiological Stimuli No of Devices No of Features Classification Accuracy


work signal Used Subjects (%)

Three emotions calm, EEG, GSR, IAPS 5 64 electrodes, 384 KNN, LDA 50 %
[18] positively excited, RSP, ST, Galvanic Skin (KNN), 70
negatively excited BVP Resistance %(LDA)
Four emotions High EMG, Film 12 4 sensors includes total of 152 SVM, ANFIS 79.3 %
[19] Stress, Low Stress, ECG, RSP, clips wearable EMG, 3 ECG, 3 (SVM), 76.7
Disappointment and and EDA balaclava for Respiration, 7 % (ANFIS)
Euphoria EMG, and EDA
wireless
communication
Four emotions (Joy, ECG, Musical 3 ProComp Infiniti total of 110 53 PLDA, EMDC 65%
[20] Anger, Sad, Pleasure) EMG, RSP, induc- ECG, 10 EMG, (PLDA),
SC tion 37 RSP, 10 SC 70%
(EMDC)
six emotions BVP, EMG, IAPS 10 5 sensors total of 306 FD,SVM 0%
[21] (Amusement, SC, RSP BVP, 6 EMG, 6 (FD)45%
Contentment, SC6 SKT6 (SVM)
Disgust, Fear, RESP
Neutral, and sadness)
nine emotions (anger, BVP, ECG, IAPS 28 ProComp Infiniti total of 36 KNN 50.3%
[22] interest, contempt, EMG, SC, 6 features for (KNN)
disgust, distress, fear, RSP each signal
joy, shame, and
surprise)
three Arousal and GSR, RSP, video 25 Biosemi active II 20 GSR, 14 SVM 46.2% for
[23] valence levels EEG, ECG, clips system RSP, 64 ECG, Arousal,
and ST and 4 ST 45.5% for
valence
Neutral, Arousal and HR, RSP, IAPS 35 BIOPAC MP150 very high with QDC 96% for
[24] valence levels EDR the used of PCA Arousal,
94% for
valence
Four emotions ECG, GSR Film 11 BIOPAC MP150 total of 11867 KNN, DT, 68% (5-NN)
[25] Neutral, sadness, fear and PPG clips ECG21 GSR30 NB, MLP and 66.14%
and pleasure PPG RF (DT)29.13%
(NB)
64.37%
(MLP)75.69%
(RF)
Level of exciting ECG, EDA, IAPS 27 Biopac MP36 , total of 22540 LSTM 80% for
[26] video and HQ headset + LQ (video), 65 network Arousal,52%
audio microphones and (audio),54 for valence
HD 720p webcam (ECG),66
(EDA)
Neutral and Arousing EDA IAPS 40 Textile electrodes 4 KNN 71.67%
[27] levels
Four emotions (Joy, EQ Radeo, multi- 12 FMCW radio 27 SVM 72,3%
[28] Pleasur, Sadness, IBI stimuli
Anger)
Neutral and Arousal EDA, ECG, film 27 Biopac MP36 total of 11053 CCC Arosal
[29] and Valence levels HR, SCL, clips (ECG), 37 46.3%Va-
SCR (RSP), 10 (SC), lence
and 10 (EMG) 40,7%
three Arousal and GSR, RSP, video 24 Biosemi active II 20 GSR, 14 SVM 59.57% for
[30] Valence levels EEG, ECG, clips system RSP, 64 ECG, Arousal,
and ST and 4 ST 57.44% for
valence

TABLE I: Related works using physiological signals. The acronyms used in the table as following: For Physiological signal:
Electroencephalography (EEG), galvanic skin response (GSR), Eectrodermal Activity (EDA), Photoplethysmography (PPG), Blood volume
pulse (BVP), Inter Beat Interval (IBI), Heart Rate Variability (HRV), Electrocardiography (ECG), Electromyography (EMG), Skin
Conductance (SC), Skin Conductance level (SCL), Skin Conductance response (SCR), Skin Temperature (ST) and Respiration (RSP). For
the stimuli: The International Affective Picture System (IAPS) is a database of pictures designed to invoke emotions. For Classification
methods: K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), Adaptive Neuro-Fuzzy
Inference System (ANFIS), Probabilistic Linear Discriminant Analysis (PLDA), Emotion-Specific Multilevel Dichotomous Classification
(EMDC), Fisher Discriminant (FD), Principal component analysis (PCA), Quadratic Discriminant Classifier (QDC), Decision Tree (DT),
Naive Bayes (NB), Multi-Layer Perceptron (MLP), Random Forest (RF), Recurrent Neural Network models for Sequence classification
(LSTM), and the Concordance Correlation Coefficient (CCC).

1558-1748 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2018.2867221, IEEE Sensors
Journal
4

classify the testing data samples into the corresponding subject since they had seen at least one film clip. We also eliminated
model according to the similarity of their inner structures. The two participants that had insufficient data. Hence, none of the
grouping showed an average improvement of 90%. However, remaining 18 participants reported having seen excerpts of any
to cluster the user in a real-time application, sufficient and of the clips. Our experiment received ethical approval from the
reliable data are needed to be able to assign a user to a certain Research Ethics Board of the University of Ottawa.
group. Therefore, in this paper, we use the DTW algorithm, b) Hardware: An Empatica E4 sensor was used to collect
which can address individual differences without the need the subjects’ physiological signals [49]. The motivation is that
for user clustering. DTW is a handy tool for matching two E4 is a multi-sensor device and is a wireless, flexible, and
time series in order to calculate an optimal path between the easy-to-use device that is worn on the wrist for real-time
two sequences [40]. It has been applied in many time series computerized biofeedback and data acquisition. Moreover, the
analyses, and the classification includes gesture analyses [41], multi-sensor data fusion represents a practical solution to
speech recognition [42], and video and graphical data [43]. infer high-quality information from noisy signals, data loss,
The complete description of the methods and materials is or inconsistency [50]. The E4 wristband has four embedded
described in the next section. sensors: a photoplethysmography (PPG) sensor, an electro-
For emotion recognition problems using physiological sig- dermal activity (EDA) sensor, a 3-axis accelerometer, and an
nals, several databases have been established to test the infrared temperature sensor. These sensors collect and report
pertinence of this modality, such as MAHNOB-HCI [23], information from the sympathetic branch of the autonomic
DEAP [44], MIT [45], and HUMAINE [46]. The signals for nervous system, including the galvanic skin response (GSR),
these databases come from electrocardiograms (ECGs), the blood volume pulse (BVP), acceleration, heart rate (HR), and
galvanic skin response (GSR), the skin temperature (Temp) skin temperature (ST).
and the respiration volume (RESP), all sampled at 256 Hz. In c) Emotion induction method: One of the critical steps
[47], the authors conducted a comparative study between the in identifying the emotion of a subject is to elicit emotional
DEAP and MAHNOB datasets. The study concluded that the responses. For the stimulus material herein, we selected the
stimulus videos in the MAHNOB dataset were more powerful emotional movie database (EMDB) because videos contain
to evoke the emotion than the videos clips used in DEAP. more emotional content than a single image. EMDB is a film
Moreover, the results obtained using the recorded signals in clip system for emotion elicitation [51]. The database consists
the MAHNOB dataset were better than those obtained using of 52 film clips without any sound effects based on emotional
DEAP. Among related studies, we compared our results with stimuli (i.e., valence, arousal, and dominance). These film
the results using the MAHNOB dataset [23]. The accuracy clips are divided into five categories: erotic, horror, socially
was 46.2% for arousal and 45.5% valence when classifying negative, socially positive, and neutral. However, using visual
the affective states into three classes. Another similar work stimulation videos without sound was not sufficient for the
was performed by applying early fusion for all features and adequate induction of emotion [52]. Thus, we added audio
an SVM classifier [30]. The accuracy was 59.57% and 57.44% to the same set of film clips in the EMDB for this study.
for arousal and valence, respectively. We decided to use movie clip number 1003 for the neutral
condition, movie clip number 2005 for the happy state, clip
III. M ETHODS AND MATERIALS number 3005 for the sadness condition, movie clip number
In this section, we describe the data acquisition and present 4000 for the erotic situation, and movie clip number 5000 for
the experiment for the emotion recognition task. We describe the horror condition. The source movie files were found on the
the participants, the used hardware, the employed emotion internet. Subsequently, we extracted the proper audio from the
induction method, and the experimental setup. Finally, we sources and then synchronized the audio with the clips in the
present a description of the methods used in the data analysis. database. Relaxing music with a smooth rhythm was played
for two minutes between each elicitation of emotion to ensure
emotion stabilization.
A. Data acquisition d) Experimental set-up: For each participant, the exper-
For our study, we compared our results using two datasets. iment involved a session lasting approximately 30 min in a
We collected our own dataset using the E4 sensor, which we quiet room. Figure 1 shows the experimental protocol. Before
refer to as the E4 dataset. The MAHNOB dataset was also the experiment, a questionnaire was completed by each subject
selected for this study using peripheral physiological signals. to record personal information related to the experiment. Then,
1) Descriptions of E4-dataset: the researchers explained the experimental procedure and any
a) Participants: A total of 24 healthy participants took risks that may occur. Before the trial was initiated, each
part in this experiment (12 males and 12 females). Their ages participant was required to wear the E4 wristband on his or
ranged from 18 to 40 years, with an average age of 25 years her non-dominant hand. Then, the participants began an active
(SD=2.3). All participants were informed about the content task (briskly cycling for 3 minutes on a laboratory bicycle),
and potential risks of the experiment. To assure that the emo- as recommended by [49]. An active task helps to build up
tional ratings given to the stimuli were due to psychological adequate moisture where the skin contacts the electrodes to
effects arising at the time of the experiment rather than due allow the electrodes in the EDA sensor to record sensitive
to the specific autobiographical memories associated with the changes [49]. This task serves the same purpose for the PPG
particular film clip used [48], we eliminated four participants sensor, which is necessary to obtain reliable HRV data. Next,

1558-1748 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2018.2867221, IEEE Sensors
Journal
5

there was a resting phase of 2 min, in which the participants similarity between two sequences, which may vary in size, in
listened to 2 min of relaxing music. Thereafter, the emotion order to calculate an optimal path between the two sequences
induction clips were played for the subjects in a sequence, [40]. The multi-dimensional DTW (MD-DTW) algorithm is
with relaxing music between each set to assure emotional an extension of the conventional DTW algorithm in which
stabilization. The design of the experiment was as follows; all dimensions are taken into account when identifying the
an initial maximal expiration task phase lasting approximately optimal matching path between two series. There are two ways
3 min, followed by a resting phase of 2 min, a sufficient visual of computing DTW in multi-dimensional time series: DT WD
stimulation period lasting 40 seconds and then a 2 min resting [43] and DT WI [41], where D denotes Dependent, and
phase and self-report. During the experiment, each participant I denotes Independent. Similar to single-dimensional DTW,
watched five film clips played in the same sequence to increase DT WD is calculated by the cumulative squared Euclidean
the intensity of perception, one for each category of emotion distances of d-dimensional data points Ti and Sj instead of
(i.e., neutral, cheer, sadness, erotic, and horror). After each the single data point used in the more familiar one-dimensional
stimuli clip, the participants were asked to report their feelings case, as shown in (1).
immediately using AniAvatar; an effective feedback tool for
k
affective states [53]. For data synchronization, Empatica E4 X
d(Ti , Sj ) = (Ti,d − Sj,d )2 . (1)
includes an event marking button that, when triggered, creates d=1
a timestamp in the session archive “tags.csv” file. This event
mark can then be used to align E4 data for the analysis stage. DT WI is the cumulative distance of all dimensions indepen-
For data acquisition and monitoring, we employed a mobile dently measured under DTW, as shown in (2). Other methods
application to collect and monitor the data during the exper- are similar to DT WI , such as adding up all the dimensions
iment; the application received the signal data via Bluetooth. and dealing with a single dimension time series [54].
The software uploaded the securely encrypted data to a cloud
k
service for data analysis. Herein, four channel signals, such X
DT WI (T, S) = DT W (Td , Sd ). (2)
as EDA, HR, inter-beat interval (IBI), and temperature, were
d=1
used.
2) Descriptions of MAHNOB-dataset: The MAHNOB DT WD has only one path to pick for all dimensions to mea-
dataset is a multimodal database recorded in response to sure their distances. Therefore, DT WD may not be a suitable
affective stimuli of emotion recognition. A multimodal setup distance measure for instances with lag. DT WI is capable of
was arranged for the synchronized recording of face videos, measuring the distances independently and is invariant to lag.
audio signals, eye gaze data, and peripheral physiological However, since DT WI is a combination of d paths, eventually
signals. A total of thirty participants from both genders and it produces a more significant distance score than DT WD [55].
different cultural backgrounds participated in the experiments. Similar to one-dimensional weighted dynamic time warping
During the emotion recognition experiment, 25 participants (1D-WDTW) [56] and weighted DT WD (W DT WD ) [57],
completed the experiment by watching 20 emotional videos we introduced a weighted multi-dimensional DTW (WMD-
and self reported the emotions they felt using arousal, valence, DTW) to reduce the overall distance score. The optimization
dominance, and predictability as well as emotional keywords method is used to minimize the warping windows of each path
[23]. and segmentation of the time series and to assign a weight for
each diminution to overcome the disadvantage of DT WI . The
wd is the weight that was assigned to each diminution d. We
B. Data Analysis
can write WMD-DTW as:
Multi-sensor data fusion is a well-established research area
k
in the emotion detection domain. There is wide literature X
W M D − DT W (T, S) = wd × |DT W (Td , Sd , wwd )|. (3)
addressing sensor fusion at different levels and using di- d=1
versified approaches; readers can follow the survey in [50].
Depending on the processing level, data fusion approaches where WMD-DTW is the cumulative distance of all dimen-
can be divided into three strategies: centralized, distributed sions of a two k-dimensions time series T and S independently
and hybrid. The centralized approach relies on a fusion center measured under DTW. DT W (Td , Sd ) is defined as the DTW
in which the processing is performed. The distributed approach distance of the dth dimension of T and the dth dimension
executes each sensor independently to form a global decision. of S. In equation (3), each dimension is considered to be
However, the hybrid data fusion approach benefits from both independent, and DTW is allowed the freedom to warp each
the centralized and distributed techniques. In this strategy, data dimension independently of the others with respect to the
collection and preprocessing are performed with a distributed warping window parameter wwd . The WMD-DTW classifier
strategy; then, a centralized strategy for fusing data performs calculates a distance matrix between two k-dimensions time
the decision-level computation [50]. In this study, we used a series T and S according to a warping window parameter wwd
hybrid data fusion approach. The following section highlights and dimension weight wd . WMD-DTW uses a 1-NN algorithm
a detailed explanation of the data analysis methods. to find the smallest distance in WMD-DTW and return the
1) Dynamic time warping (DTW): DTW is a handy tool corresponding label associated with that Sd sample within S.
for matching two time series. The algorithm measures the The optimization process is illustrated in the next section.

1558-1748 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2018.2867221, IEEE Sensors
Journal
6

Fig. 1: Experimental protocol for the data acquisition involving the resting phases, and the visual stimulation periods. RM corresponds to
relaxing music.

2) Tuning the hyper-parameters of an estimator: Recently, to propose a candidate point xi [61].


researchers in machine learning have treated the hyper-
U CB(x) = µGP (x) + κσGP (x) (4)
parameter optimization strategy as an important component of
all learning algorithms [58]. The most commonly employed EI(x) = E[f (x) − f (x+
t )] (5)
optimization algorithms are known as sequential model-based P I(x) = P (f (x) ≥ f (x+
t ) + κ) (6)
optimization (SMBO) algorithms, also known as Bayesian
We used upper confidence bound (UCB), negative expected
optimization. In SMBO approaches, there are three main
improvement (EI), and the negative probability of improve-
tuning methods: grid searches, random searches, and Bayesian
ment (PI) as acquisition functions to minimize over the Gaus-
optimization-based searches. The grid search is the most
sian prior. where µGP (x) is high (exploitation), and κσGP (x)
widely used method; however, grid searches retain the best
is high (exploration). κ is used in acquisition functions to
combination after all the possible combinations of parameter
control the trade-off between exploration-exploitation, while
values have been evaluated [59]. As samples are expensive to
x+t is the best point observed so far.
acquire, the increased precision comes at the cost of a higher
b) The optimization and Cross-Validation: By using op-
runtime. Other search methods have more favorable properties
timization, the function from hyper-parameter values evaluates
than an exhaustive search, such as Gaussian process (GP)-
a validation set. Cross-validation methods are used to evaluate
based optimization, which samples based on educated guesses
the machine learning model at each iteration of the GP. The
more effective in high dimensional spaces [60]. However, GP-
aim is to reveal the highest accuracy possible at the lowest
based algorithms are more complicated than grid searches as
number of iterations possible. We employed two-user-out
they have several free hyper-parameters themselves; Bayesian
cross-validation with F-score of the classification as our metric
optimization has been shown to obtain better results in fewer
to identify the best parameter optima in the WMD-DTW.
experiments than grid search and random search [59]. Thus,
At each iteration, one users data was used for optimization
GP-based optimization is used for precisely finding the regu-
validation, while the other users data was used for the meta-
larization parameter optima in the WDTW.
learning approach to prevent overfitting.
3) Ensemble learning: The ensemble method is a powerful
a) Gaussian Process Based Optimization: Gaussian op- technique for improving the modeling of the learning algo-
timization is a technique for the global optimization of noisy rithm and significantly boosting predictive power [63]. It is
black-box functions. GP-based optimization is a statistical a linear model in a very high dimensional space, where each
model in which a GP prior distribution is chosen to describe point in the space represents the model weights. It can improve
the unknown function under optimization using an acquisition the classification accuracy by canceling out independent errors
function [60]. The acquisition function is typically an inex- made by individual classifiers [64]. The most popular methods
pensive function that can be evaluated at a given point. A include Bayesian model averaging, bagging, boosting, and
GP surrogate model is maintained for the unknown function stacking. The Bayesian model uses a vote proportional to the
to determine the optimal next sampling location. The model likelihood [65], bagging uses an unweighted majority vote
is updated with every newly obtained sample value [61]. We [66], boosting uses a weighted majority vote, and stacking
used Scikit-learn [62], an open-source package for hyper- learns an ensemble classifier based on the output of multiple
parameter tuning methods. The Scikit-Optimize library pro- base classifiers [63]. Some studies [66], [67], [68], [65]
vides algorithms for performing hyper-parameter optimization. show that stacking has robust performance. In fact, stacking
For Gaussian priors, initial points were used to find local performs better than both boosting and bagging [66]. It also
minima. The optimal of these local minima is used to update outperforms the Bayesian model averaging scheme [65]. Thus,
the prior. The GP forms the posterior distribution over the we used stacking to learn a high-level classifier on top of the
objective function. Then, the posterior distribution is used base classifiers, called first-level classifiers.
to construct an acquisition function for querying the next a) Stacking classifier: Stacking is a successful technique
point. Each acquisition function is optimized independently to combine multiple models in the ensemble model to achieve

1558-1748 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2018.2867221, IEEE Sensors
Journal
7

IV. R ESULTS AND DISCUSSION


Training set For the analysis, we used two datasets: the E4 dataset that
we collected and the MAHNOB dataset. For the E4 dataset,
Figure 3 shows the confusion matrix1 for both models and
Training folds Validation fold
the stacking model. Table II shows a summary of the results
of the various models that were used for the E4 dataset.
The accuracy is calculated using the WMD-DTW and 1-
NN classifiers. For the MAHNOB dataset, Table III shows
training
the comparative summary result of the various models that
Repeat k times

were used for the MAHNOB dataset. Figures 4 and 5 show


C1 C2 . . . Cm Classifiers
the confusion matrix1 for the various base models with and
predictions
without padding, while Figure 6 shows the overall staking
results for both arousal and valence.
P1 P2 . . . Pm Level-1 predictions
in kth iteration
From this analysis, it is clear that the performance slightly
varies from one model to another. Table II shows the summary
results using the E4-dataset. It includes the optimized param-
All level-1 predictions
eters for each signal such as the max warping window, delay
time, frame size, and weight parameters. Each sensor makes
an independent report based on its weight as reported using
Meta-Classifier
- the optimization process. The EDA signal has a significant
influence in the first model compared to the second model.
However, the HR signal shows a higher weight in the second
model compared to the first model. In Figure 3, the basic
Pf Final prediction
model works by finding a discriminate pattern that can distin-
guish between five different emotion classes including neutral,
cheer, sadness, erotic, and horror. This model performs the
best for the horror and neutral emotions but lacks accuracy
Fig. 2: The Stacking Cross-Validation Procedure [64]. for the sad emotion. The main disadvantage of this model
is user perception deference. To overcome this issue, the
diminution model uses self-report [7]. This model shows
better performance. It is a meta-learning approach in which the adequate performance for the sad and cheer emotions and
base classifiers are called first-level classifiers, and a second- deficient achievement for the erotic emotion. Since self-report
level classifier is learned to combine the first-level classifiers can only measure the perception of emotional reaction after the
[64], as shown in Figure 2. At the first-level classifiers, we end of a stimulus, the emotional measurement may be affected
apply a bootstrap sampling technique to learn independent due to many reasons. During the experiment, some participants
classifiers, adaptively learn base classifiers based on data were confused about whether they felt happy or neutral.
with a weight distribution, tune parameters in a learning Another limitation is that participants may be unwilling to
algorithm to generate diverse base classifiers (homogeneous report their emotion because of shame or concerns about social
classifiers), and apply different classification methods and/or desirability. This was very obvious in the erotic emotion. There
sampling methods to generate base classifiers (heterogeneous was an overall accuracy of 63.3% for the first base model
classifiers). At the next stage, the second-level classifiers, we and an accuracy of 61.1% in the second base model. The
construct a new data set based on the output of base classifiers. WMD-DTW of the objective model presented a slightly better
Here, we used class probabilities from the first-level classifier performance than the subjective model. The stacking WMD-
as new features in the second-level classifier to train the meta- DTW outperformed both models, as the average accuracy was
classifier for successful stacking [66]. The advantage of using 65.6% for all users.
conditional probabilities as features is that the training set of Visual inspection of the time series recorded during the
the second-level classifier will include not only predictions but experiment shows that the EDA performance differs from one
also prediction confidence of first-level classifiers. We used an user to another. We identify three groups of subjects with
open-source python library, StackingCVClassifier [64]. similar emotion-relevant physiological response patterns. The
first group, the standard group, shows a strong relationship
b) Stacking and Cross-Validation: To avoid overfitting, between the arousal level and the phasic driver peak amplitude
a stacking cross-validation method within first-level classifiers
1 The detailed confusion matrix results include the overall accuracy, the
is used [68]. Similar to the base classifiers, we used one-user-
precision, and the recall. The column on the far right of the plot shows the
out cross-validation to evaluate classification performance. positive predictive and the false discovery rate. The row at the bottom of
The user data used in this validation are not used in the the plot shows true positive and false negative rates. The cell in the bottom
model generated in the first-level classification. The first-level right of the plot shows the overall accuracy. The diagonal cells correspond
to observations that are correctly classified. The off-diagonal cells correspond
classifiers are fit to the different data set that is used to prepare to incorrectly classified observations. In each cell, both the number and the
the inputs for the second-level classifier. percentage of observations are displayed

1558-1748 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2018.2867221, IEEE Sensors
Journal
8

(a) (b) (c)


10 2 1 1 0 71.4% 6 2 1 1 1 54.5% 11 2 1 1 0 73.3%
1 1 1
11.1% 2.2% 1.1% 1.1% 0.0% 28.6% 6.7% 2.2% 1.1% 1.1% 1.1% 45.5% 12.2% 2.2% 1.1% 1.1% 0.0% 26.7%

3 12 4 1 1 57.1% 1 25 3 3 3 71.4% 2 13 4 1 1 61.9%


2 2 2
3.3% 13.3% 4.4% 1.1% 1.1% 42.9% 1.1% 27.8% 3.3% 3.3% 3.3% 28.6% 2.2% 14.4% 4.4% 1.1% 1.1% 38.1%
Output Class

Output Class

Output Class
3 3 9 2 1 50.0% 0 3 9 1 1 64.3% 3 2 9 2 1 52.9%
3 3 3
3.3% 3.3% 10.0% 2.2% 1.1% 50.0% 0.0% 3.3% 10.0% 1.1% 1.1% 35.7% 3.3% 2.2% 10.0% 2.2% 1.1% 47.1%

2 0 3 13 3 61.9% 1 3 2 4 1 36.4% 2 0 3 13 3 61.9%


4 4 4
2.2% 0.0% 3.3% 14.4% 3.3% 38.1% 1.1% 3.3% 2.2% 4.4% 1.1% 63.6% 2.2% 0.0% 3.3% 14.4% 3.3% 38.1%

0 1 1 1 13 81.2% 1 6 1 0 11 57.9% 0 1 1 1 13 81.2%


5 5 5
0.0% 1.1% 1.1% 1.1% 14.4% 18.8% 1.1% 6.7% 1.1% 0.0% 12.2% 42.1% 0.0% 1.1% 1.1% 1.1% 14.4% 18.8%

55.6% 66.7% 50.0% 72.2% 72.2% 63.3% 66.7% 64.1% 56.2% 44.4% 64.7% 61.1% 61.1% 72.2% 50.0% 72.2% 72.2% 65.6%
44.4% 33.3% 50.0% 27.8% 27.8% 36.7% 33.3% 35.9% 43.8% 55.6% 35.3% 38.9% 38.9% 27.8% 50.0% 27.8% 27.8% 34.4%

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Target Class Target Class Target Class

Fig. 3: The confusion matrix for the E4-dataset, (a) using objective model [15], (b) subjective model [17], and (c) Stacking result. The
classes are: 1- neutral, 2- cheer, 3- sadness, 4- erotic, and 5- horror 1 .

Parameters % Accuracy
DTW Algorithms Signal Max Warping Delay Frame Size Weight F1-score Precision Recall
EDA 29 34 44 1.00
MW-DTW
HR 89 4 44 0.53 63.44 64.35 63.33
using objective model [15]
Temp 39 4 44 0.24
EDA 34 68 10 0.55
MW-DTW
HR 5 68 10 1.00 62.46 63.12 62.22
using subjective model [17]
Temp 57 48 30 0.33

TABLE II: A summary comparative results of the various models using E4-dataset.

of the EDA signal, as we expect [69]. A small portion of the used post-sequence padding that was applied to the end of
general population shows changes in the phasic peak amplitude each sequence. We presented the emotional states in three
of the EDA signal. These minimal changes can be due to many defined classes. The classes were calm, medium and activated
factors, such as medications that suppress the sympathetic for arousal and unpleasant, neutral and pleasant for valence.
nervous system response. Another factor is user personality; The classes were defined using nine emotional keywords
one subject from the second group indicated that she does (happiness, amusement, neutral, anger, fear, surprise, anxiety,
not show emotion and typically refers to herself as a cold disgust and sadness) as in [23] and [30].
person, and another subject also mentioned that he hardly In Table III, we note that the WMD-DTW of the objective
gets emotional. The third group, however, shows significant model presented a significantly better performance than the
changes in the driver peak amplitude of the EDA. subjective model. We achieved 71.00% for arousal and 65.50%
The second experimental result was obtained using the for valence for the first base model. We also achieved 93.76%
MAHNOB dataset. Each trial contained 30 seconds before the for arousal and 93.79% for valence for the second base model,
beginning of the effective stimuli and another 30 seconds after which we consider very promising compared to related studies.
the end. We used the first 30 seconds for the normalization Our proposed system achieved overall accuracies of 94.0%
step; the remaining parts were used in the classification and 93.6% for recognizing valence and arousal emotions,
models. Since the 20 emotional videos were varied in length, respectively. In the comparative study, these values outperform
a percentage delay and a frame size where selected based the two related works: 46,2% for arousal and 45.5% for
on the following equation: Fs + Dt = Sl + α, where (Fs ) valence [23] and 59.57% for arousal and 57.44% for valence
is the frame size, (Dt ) is the delay time, (Sl ) is the stimuli [30].
length, and α is a percentage of the stimuli length. Moreover, The MAHNOB dataset shows higher improvement for two
due to the variation of the length of the emotional videos, reasons. First, the obtained data uses different methods, such
the segmentation resulted in unequal frame sizes. Therefore, as the user numbers, the hardware sensors, and the emotion
we attempted two different techniques to calculate the WMD- induction methods. Second, the emotion recognition systems
DTW. The first technique used the exact length without any for the MAHNOB dataset map the two dimensions into three
padding. The second technique used the padding technique classes within arousal-valence areas. The classes are calm,
to achieve the same frame size for the entire dataset. In the medium and activated for arousal and unpleasant, neutral and
time series analysis, for different length input sequences, it pleasant for valence. It is not necessary to map dimensional
is a common practice to use zero padding such that each spaces into a specific emotion. For example, fear and anger
sequence has the same length [70]. Padding can be applied share unpleasant and activated aroused feelings but differ in the
to either the beginning or at the end of the sequences. We external causes and behavioral reactions, yet the model cannot

1558-1748 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2018.2867221, IEEE Sensors
Journal
9

(a) (b) (c) (d)


153 36 24 71.8% 186 16 0 92.1% 148 46 33 65.2% 181 13 0 93.3%
1 1 1 1
Output Class 30.6% 7.2% 4.8% 28.2% 37.2% 3.2% 0.0% 7.9% 29.6% 9.2% 6.6% 34.8% 36.2% 2.6% 0.0% 6.7%

Output Class

Output Class

Output Class
39 134 7 74.4% 14 183 0 92.9% 44 121 9 69.5% 19 185 0 90.7%
2 2 2 2
7.8% 26.8% 1.4% 25.6% 2.8% 36.6% 0.0% 7.1% 8.8% 24.2% 1.8% 30.5% 3.8% 37.0% 0.0% 9.3%

23 8 76 71.0% 0 1 100 99.0% 23 10 66 66.7% 0 2 100 98.0%


3 3 3 3
4.6% 1.6% 15.2% 29.0% 0.0% 0.2% 20.0% 1.0% 4.6% 2.0% 13.2% 33.3% 0.0% 0.4% 20.0% 2.0%

71.2% 75.3% 71.0% 72.6% 93.0% 91.5% 100% 93.8% 68.8% 68.4% 61.1% 67.0% 90.5% 92.5% 100% 93.2%
28.8% 24.7% 29.0% 27.4% 7.0% 8.5% 0.0% 6.2% 31.2% 31.6% 38.9% 33.0% 9.5% 7.5% 0.0% 6.8%

1 2 3 1 2 3 1 2 3 1 2 3
Target Class Target Class Target Class Target Class

Fig. 4: The confusion matrix for arousal using MAHNOB-dataset (a) MW-DTW using 9 felt emotion Keywords without padding, (b)
MW-DTW using stimuli labeling without padding, (c) MW-DTW using 9 felt emotion Keywords with padding, and (d) MW-DTW using
stimuli labeling with padding. The classes are 1- calm, 2- medium and 3- activated 1 .

(a) (b) (c) (d)


55 31 18 52.9% 140 0 9 94.0% 52 34 20 49.1% 138 2 17 87.9%
1 1 1 1
11.0% 6.2% 3.6% 47.1% 28.0% 0.0% 1.8% 6.0% 10.4% 6.8% 4.0% 50.9% 27.6% 0.4% 3.4% 12.1%
Output Class

Output Class

Output Class

Output Class
31 154 28 72.3% 0 147 8 94.8% 25 143 37 69.8% 0 143 5 96.6%
2 2 2 2
6.2% 30.8% 5.6% 27.7% 0.0% 29.4% 1.6% 5.2% 5.0% 28.6% 7.4% 30.2% 0.0% 28.6% 1.0% 3.4%

18 33 132 72.1% 10 3 183 93.4% 28 41 120 63.5% 12 5 178 91.3%


3 3 3 3
3.6% 6.6% 26.4% 27.9% 2.0% 0.6% 36.6% 6.6% 5.6% 8.2% 24.0% 36.5% 2.4% 1.0% 35.6% 8.7%

52.9% 70.6% 74.2% 68.2% 93.3% 98.0% 91.5% 94.0% 49.5% 65.6% 67.8% 63.0% 92.0% 95.3% 89.0% 91.8%
47.1% 29.4% 25.8% 31.8% 6.7% 2.0% 8.5% 6.0% 50.5% 34.4% 32.2% 37.0% 8.0% 4.7% 11.0% 8.2%

1 2 3 1 2 3 1 2 3 1 2 3
Target Class Target Class Target Class Target Class

Fig. 5: The confusion matrix for valence using MAHNOB-dataset (a) MW-DTW using 9 felt emotion Keywords without padding, (b)
MW-DTW using stimuli labeling without padding, (c) MW-DTW using 9 different felt emotion Keywords with padding, and (d)
MW-DTW using stimuli labeling with padding. The classes are 1- unpleasant, 2- neutral and 3- pleasant 1 .

Accuracy (%)
F1-score Precision Recall F1-score Precision Recall
DTW Algorithms with padding without padding
MW-DTW Arousal 67.48 70.35 67.06 71.00 71.00 71.09
(using 9 emotion Keywords) Valence 63.15 68.55 63.18 65.50 65.62 65.60
MW-DTW Arousal 93.04 93.46 93.06 93.76 93.21 93.20
(using stimuli labeling) Valence 90.84 91.82 90.74 93.79 93.80 93.80
Arousal stacking result 94.00 95.04 93.05
Valence stacking result 93.69 93.78 93.68

TABLE III: A summary comparative results of the various models using MAHNOB-dataset.

(a) (b) capture the difference between them. However, the E4 dataset
1
186 16 0 92.1%
1
140 0 9 94.0% maps the dimensional spaces into a specific emotion despite
37.2% 3.2% 0.0% 7.9% 28.0% 0.0% 1.8% 6.0%
the fact that they share the same characteristics. For example,
Output Class

Output Class

2
14 183 1 92.4%
2
0 147 8 94.8% the neutral and sad emotions share the same calm arousal,
2.8% 36.6% 0.2% 7.6% 0.0% 29.4% 1.6% 5.2%
and sadness and fear share the same unpleasant valence.
3
0 1 99 99.0%
3
10 3 183 93.4% The complexity is higher for the E4 dataset than the second
0.0% 0.2% 19.8% 1.0% 2.0% 0.6% 36.6% 6.6%
dataset resulting in the accuracy reduction. Overall, it can be
93.0% 91.5% 99.0% 93.6% 93.3% 98.0% 91.5% 94.0% seen that the discriminate among physiological patterns was
7.0% 8.5% 1.0% 6.4% 6.7% 2.0% 8.5% 6.0% improved using the ensembling process. This superiority is
1 2 3 1 2 3 attributed to three reasons. First, the optimization methods
Target Class Target Class help select the best parameters for the signals to have the
most significant information possible. The selection of the
Fig. 6: The Stacking confusion matrix for MAHNOB-dataset (a) delay component reduced the degree of synchronicity that
Valence and (b) Arousal 1 . the emotional components display over time and the frame
size address the individual differences in the duration of the
participants emotions. The second reason is that the WMD-

1558-1748 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2018.2867221, IEEE Sensors
Journal
10

DTW learned the dynamic nature of the emotion and the and 93.6% for the MAHNOB dataset for recognizing valence
individual differences in the variability of emotional intensity and arousal emotions, respectively. The ensemble approach
over time without the need for group-based models. The markedly outperformed any of the base models.
hyper-parameter tuning proved to be the vital step and can In future work, we plan to integrate more features, includ-
had a large impact on the end performance of the machine ing non-invasive and continuous blood pressure measurement
learning models. The hyper-parameter tuning proved to have approaches based on the PPG signal, as in [71], and we
a large impact on the end performance of the Ekman models. will study the impact on the accuracy. Moreover, as mobile
Notably, the optimizer gives more weight to EDA for the first devices become increasingly more powerful with the use of
base model. However, the second base model gives greater cloud computing, we also plan to implement the proposed
weight to HR. The third reason for the superiority of the emotional analysis model in a mobile device, such as an
achieved accuracies is the use of the ensemble approach, which Android mobile phone. The system could be combined with a
allows for multiple emotion models and learning algorithms combination of emotions and multimedia to take advantage
to be jointly embedded within a meta-learning model. This of our emotion recognition system in other technological
result is because the staking result captures the individual fields, such as computer games, special education, and social
differences in both emotional perception and the component networks.
architecture. As seen in the experiment, the stimulus inter-
pretation is heavily dependent on the user, which appeared R EFERENCES
obvious when one of the users felt excited and overjoyed [1] Gallup and J. Clifton, “Gallup 2017 global emotions report.” Gallup
when seeing a horror clip. For the MAHNOB dataset, the World Poll, 2017.
result in Figures 4 and 5 suggest that most of the individual [2] F. Gino and G. Pisano, “Toward a theory of behavioral operations,”
Manufacturing & Service Operations Management, vol. 10, no. 4,
differences were based on the dynamic nature of the emotion. pp. 676–691, 2008.
As a result, the objective model (stimuli labeling model) [3] P. Ekman, R. W. Levenson, and W. V. Friesen, “Autonomic nervous
shows a significant improvement over the subjective model. system activity distinguishes among emotions,” American Association
for the Advancement of Science, 1983.
The result proves that the stimulus videos in the MAHNOB [4] L. F. Barrett, “Are emotions natural kinds?,” Perspectives on psycholog-
dataset were more influential to evoke the emotion. This result ical science, vol. 1, no. 1, pp. 28–58, 2006.
also suggests that the self-report may have some bias; as a [5] C. E. Izard, “Organizational and motivational functions of discrete
emotions.,” 1993.
result of the subjective model (9 emotion keywords model), [6] P. Kuppens, J. Stouten, and B. Mesquita, “Individual differences in
there is some insignificant improvement over the objective emotion components and dynamics: Introduction to the special issue,”
model. The average agreement using the multilayer Cohens Cognition and Emotion, vol. 23, no. 7, pp. 1249–1258, 2009.
[7] J. A. Russell and G. Pratt, “A description of the affective quality at-
kappa test was k = 0.32 for arousal and valence ratings. The tributed to environments.,” Journal of personality and social psychology,
correlation values were m = 0.40(SD = 0.26) for arousal vol. 38, no. 2, p. 311, 1980.
and m = 0.71(SD = 0.12) for valence [23]. In Figure 6, the [8] J. A. Russell and A. Mehrabian, “Distinguishing anger and anxiety in
terms of emotional response factors.,” Journal of consulting and clinical
objective model and subjective model within the ensembling psychology, vol. 42, no. 1, p. 79, 1974.
process outperformed any individual model and performed at [9] M. F. King and G. C. Bruner, “Social desirability bias: A neglected
least as good as the best of the base models. This outcome aspect of validity testing,” Psychology and Marketing, vol. 17, no. 2,
pp. 79–103, 2000.
will most likely depend on the size and quality of the dataset [10] P. Totterdell, B. Parkinson, R. B. Briner, and S. Reynolds, “Forecasting
used. feelings: The accuracy and effects of self-predictions of mood,” Journal
of Social Behavior and Personality, vol. 12, no. 3, p. 631, 1997.
V. C ONCLUSION AND F UTURE WORK [11] K. Poels and S. Dewitte, “How to capture the heart? reviewing 20
years of emotion measurement in advertising,” Journal of Advertising
The implementation of a physiological, signal-based emo- Research, vol. 46, no. 1, pp. 18–37, 2006.
tion recognition system involves several stages: physiologi- [12] B. T. Engel, “Stimulus-response and individual-response specificity,”
AMA Archives of General Psychiatry, vol. 2, no. 3, pp. 305–313, 1960.
cal signal data acquisition, preprocessing, feature extraction, [13] G. Stemmler and J. Wacker, “Personality, emotion, and individual
feature selection, and classification. Each step has been dis- differences in physiological responses,” Biological psychology, vol. 84,
cussed and summarized in this paper. It was noted that the no. 3, pp. 541–551, 2010.
[14] P. Kuppens, F. Tuerlinckx, M. Yik, P. Koval, J. Coosemans, K. J.
performance of each stage of the emotion recognition sys- Zeng, and J. A. Russell, “The relation between valence and arousal
tem is interdependent. Therefore, optimization techniques are in subjective experience varies with personality and culture,” Journal of
employed to select the most suitable parameters for reducing personality, vol. 85, no. 4, pp. 530–542, 2017.
[15] P. Ekman, “Are there basic emotions?,” Psychological review, no. 99,
the effect on recognition accuracy. In this paper, we collected pp. 550–3, 1992.
sufficient physiological data from 18 subjects in five induced [16] J. A. Russell, “Core affect and the psychological construction of
emotions (neutral, cheer, sad, erotic and horror) to investigate emotion.,” Psychological review, vol. 110, no. 1, p. 145, 2003.
[17] M. Yik, J. A. Russell, and J. H. Steiger, “A 12-point circumplex structure
the performance of physiological responses toward emotion of core affect.,” Emotion, vol. 11, no. 4, p. 705, 2011.
recognition in a user-independent scenario. We conducted a [18] Z. Khalili and M. H. Moradi, “Emotion detection using brain and
reasonable experiment for emotion elicitation and data col- peripheral signals,” Biomedical Engineering Conference CIBEC, 2008.
[19] C. Katsis, N. Katertsidis, G. Ganiatsas, and D. Fotiadis, “Toward emo-
lection. In our design, we obtained hundreds of records of tion recognition in car-racing drivers: a biosignal processing approach,”
physiological signals for each emotion, and we improved the Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE
emotion elicitation by using both discrete and dimensional Transactions on, vol. 38, pp. 502–512, 2008.
[20] J. Kim and E. Andre, “Emotion recognition based on physiological
emotion models. Our proposed system achieved overall ac- changes in music listening,” IEEE Transactions on Pattern Analysis and
curacies of 65.6% for all users for the E4 dataset and 94.0% Machine Intelligence, vol. 30, p. 12, 2008.

1558-1748 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2018.2867221, IEEE Sensors
Journal
11

[21] C. Maaoui and A. Pruski, “Emotion recognition through physiological [46] E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry,
signals for human-machine communication,” in Cutting Edge Robotics, M. Mcrorie, J.-C. Martin, L. Devillers, S. Abrilian, A. Batliner, et al.,
InTech, 2010. “The humaine database: addressing the collection and annotation of
[22] Y. Gu, S.-L. Tan, K.-J. Wong, M.-H. R. Ho, and L. Qu, “A biometric naturalistic and induced emotional data,” Affective computing and in-
signature based system for improved emotion recognition using phys- telligent interaction, 2007.
iological responses from multiple subjects,” in Industrial Informatics [47] C. Godin, F. Prost-Boucle, A. Campagne, S. Charbonnier, S. Bonnet,
(INDIN), 8th IEEE International Conference, pp. 61–66, IEEE, 2010. and A. Vidal, “Selection of the most relevant physiological features for
[23] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, “A multimodal classifying emotion,” Emotion, vol. 40, p. 20, 2015.
database for affect recognition and implicit tagging,” IEEE Transactions [48] K. S. LaBar and R. Cabeza, “Cognitive neuroscience of emotional
on Affective Computing, vol. 3, no. 1, pp. 42–55, 2012. memory,” Nature Reviews Neuroscience, vol. 7, no. 1, pp. 54–64, 2006.
[24] G. Valenza, A. Lanatá, and E. P. Scilingo, “Improving emotion recog- [49] M. Garbarino, M. Lai, D. Bender, R. W. Picard, and S. Tognetti,
nition systems by embedding cardiorespiratory coupling,” Physiological “Empatica E3-A wearable wireless multi-sensor device for real-time
measurement, vol. 34, no. 4, p. 449, 2013. computerized biofeedback and data acquisition,” in Wireless Mobile
[25] C. Li, C. Xu, and Z. Feng, “Analysis of physiological for emotion Communication and Healthcare (Mobihealth), EAI 4th International
recognition with the irs model,” Neurocomputing, vol. 178, 2016. Conference on, IEEE, 2014.
[26] F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.-P. Thiran, T. Ebrahimi, [50] R. Gravina, P. Alinia, H. Ghasemzadeh, and G. Fortino, “Multi-sensor
D. Lalanne, and B. Schuller, “Prediction of asynchronous dimensional fusion in body sensor networks: State-of-the-art and research chal-
emotion ratings from audiovisual and physiological data,” Pattern lenges,” Information Fusion, vol. 35, pp. 68–80, 2017.
Recognition Letters, vol. 66, pp. 22–30, 2015. [51] S. Carvalho, J. Leite, S. Galdo-Álvarez, and O. F. Gonçalves, “The emo-
tional movie database (EMDB): A self-report and psychophysiological
[27] A. Greco, A. Lanata, L. Citi, N. Vanello, G. Valenza, and E. P. Scilingo,
study,” Applied psychophysiology and biofeedback, 2012.
“Skin admittance measurement for emotion recognition: A study over
[52] K. H. Kim, S. W. Bang, and S. R. Kim, “Emotion recognition system
frequency sweep,” Electronics, vol. 5, no. 3, p. 46, 2016.
using short-term monitoring of physiological signals,” Medical and
[28] M. Zhao, F. Adib, and D. Katabi, “Emotion recognition using wireless biological engineering and computing, vol. 42, no. 3, pp. 419–427, 2004.
signals,” in Proceedings of the 22nd Annual International Conference [53] A. Sonderegger, K. Heyden, A. Chavaillaz, and J. Sauer, “Anisam &
on Mobile Computing and Networking, pp. 95–108, ACM, 2016. aniavatar: Animated visualizations of affective states,” in Proceedings
[29] G. Keren, T. Kirschstein, E. Marchi, F. Ringeval, and B. Schuller, “End- of the CHI Conference on Human Factors in Computing Systems, 2016.
to-end learning for dimensional emotion recognition from physiological [54] D. McGlynn and M. G. Madden, “An ensemble dynamic time warping
signals,” in IEEE International Conference on Multimedia and Expo classifier with application to activity recognition,” in Research and
(ICME), 2017. development in intelligent systems xxvii, pp. 339–352, Springer, 2011.
[30] M. B. H. Wiem and Z. Lachiri, “Emotion classification in arousal [55] M. Shokoohi-Yekta, B. Hu, H. Jin, J. Wang, and E. Keogh, “Generalizing
valence model using mahnob-hci database,” Int. J. Adv. Comput. Sci. dtw to the multi-dimensional case requires an adaptive approach,” Data
Appl. IJACSA, vol. 8, no. 3, 2017. Mining and Knowledge Discovery, 2017.
[31] M. Chen, P. Zhou, and G. Fortino, “Emotion communication system.,” [56] Y.-S. Jeong, M. K. Jeong, and O. A. Omitaomu, “Weighted dynamic
IEEE Access, vol. 5, pp. 326–337, 2017. time warping for time series classification,” Pattern Recognition, 2011.
[32] G. Muhammad and M. F. Alhamid, “User emotion recognition from a [57] T. Arici, S. Celebi, A. S. Aydin, and T. T. Temiz, “Robust gesture
larger pool of social network data using active learning,” Multimedia recognition using feature pre-processing and weighted dynamic time
Tools and Applications, vol. 76, no. 8, pp. 10881–10892, 2017. warping,” Multimedia Tools and Applications, 2014.
[33] M. S. Hossain and G. Muhammad, “Audio-visual emotion recognition [58] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for
using multi-directional regression and ridgelet transform,” Journal on hyper-parameter optimization,” in Proceedings of the 24th International
Multimodal User Interfaces, vol. 10, no. 4, pp. 325–333, 2016. Conference on Neural Information Processing Systems, NIPS’11, (USA),
[34] S. Hamann and T. Canli, “Individual differences in emotion processing,” pp. 2546–2554, Curran Associates Inc., 2011.
Current opinion in neurobiology, vol. 14, no. 2, pp. 233–238, 2004. [59] S. M. LaValle, M. S. Branicky, and S. R. Lindemann, “On the rela-
[35] M. D. Lewis, “Bridging emotion theory and neurobiology through tionship between classical grid search and probabilistic roadmaps,” The
dynamic systems modeling,” Behavioral and brain sciences, 2005. International Journal of Robotics Research, 2004.
[36] D. Sander, D. Grandjean, and K. R. Scherer, “A systems approach to [60] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza-
appraisal mechanisms in emotion,” Neural networks, 2005. tion of machine learning algorithms,” in Advances in neural information
[37] P. Verduyn, E. Delvaux, H. Van Coillie, F. Tuerlinckx, and I. Van Meche- processing systems, pp. 2951–2959, 2012.
len, “Predicting the duration of emotional experience: two experience [61] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas,
sampling studies.,” Emotion, vol. 9, no. 1, p. 83, 2009. “Taking the human out of the loop: A review of bayesian optimization,”
[38] P. Kuppens, I. Van Mechelen, J. B. Nezlek, D. Dossche, and T. Tim- Proceedings of the IEEE, vol. 104, pp. 148–175, Jan 2016.
mermans, “Individual differences in core affect variability and their [62] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
relationship to personality and psychological adjustment.,” Emotion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al.,
vol. 7, no. 2, p. 262, 2007. “Scikit-learn: Machine learning in python,” Journal of Machine Learning
Research, vol. 12, no. Oct, pp. 2825–2830, 2011.
[39] I. B. Mauss, R. W. Levenson, L. McCarter, F. H. Wilhelm, and J. J.
[63] D. H. Wolpert, “Stacked generalization,” Neural networks, vol. 5, no. 2,
Gross, “The tie that binds? coherence among emotion experience,
pp. 241–259, 1992.
behavior, and physiology.,” Emotion, vol. 5, no. 2, p. 175, 2005.
[64] C. C. Aggarwal, Data classification: algorithms and applications. CRC
[40] D. Sankoff and J. B. Kruskal, “Time warps, string edits, and macro- Press, 2014.
molecules: the theory and practice of sequence comparison,” Addison- [65] B. Clarke, “Comparing bayes model averaging and stacking when model
Wesley Publication, 1983. approximation error cannot be ignored,” Journal of Machine Learning
[41] G. A. ten Holt, M. J. Reinders, and E. A. Hendriks, “Multi-dimensional Research, vol. 4, no. Oct, pp. 683–712, 2003.
dynamic time warping for gesture recognition,” in Thirteenth annual [66] K. M. Ting and I. H. Witten, “Issues in stacked generalization,” J. Artif.
conference of the Advanced School for Computing and Imaging, 2007. Intell. Res.(JAIR), vol. 10, pp. 271–289, 1999.
[42] J. Wang, A. Samal, J. R. Green, and F. Rudzicz, “Sentence recognition [67] G. Seni and J. F. Elder, “Ensemble methods in data mining: improving
from articulatory movements for silent speech interfaces,” in Acoustics, accuracy through combining predictions,” Synthesis Lectures on Data
Speech and Signal Processing (ICASSP), IEEE, 2012. Mining and Knowledge Discovery, vol. 2, no. 1, pp. 1–126, 2010.
[43] F. Petitjean, J. Inglada, and P. Gançarski, “Satellite image time series [68] R. Kohavi et al., “A study of cross-validation and bootstrap for accuracy
analysis under time warping,” IEEE transactions on geoscience and estimation and model selection,” in Ijcai, Stanford, CA, 1995.
remote sensing, vol. 50, no. 8, pp. 3081–3095, 2012. [69] W. Boucsein, Electrodermal activity. Springer Science & Business
[44] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, Media, 2012.
T. Pun, A. Nijholt, and I. Patras, “Deap: A database for emotion analysis; [70] R. E. Thomson and W. J. Emery, Data analysis methods in physical
using physiological signals,” IEEE Transactions on Affective Computing, oceanography. Newnes, 2014.
vol. 3, no. 1, pp. 18–31, 2012. [71] G. Fortino and V. Giampà, “PPG-based methods for non invasive and
[45] J. A. Healey and R. W. Picard, “Detecting stress during real-world driv- continuous blood pressure measurement: an overview and development
ing tasks using physiological sensors,” IEEE Transactions on intelligent issues in body sensor networks,” in Medical Measurements and Appli-
transportation systems, vol. 6, no. 2, pp. 156–166, 2005. cations Proceedings (MeMeA), IEEE International Workshop, 2010.

1558-1748 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Potrebbero piacerti anche