Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Brain
Recognizing Emotions in the Singing Voice
Klaus R. Scherer, Stphanie Trznadel, Bernardino Fantini, and Johan Sundberg
Online First Publication, October 23, 2017. http://dx.doi.org/10.1037/pmu0000193
CITATION
Scherer, K. R., Trznadel, S., Fantini, B., & Sundberg, J. (2017, October 23). Recognizing Emotions in
the Singing Voice. Psychomusicology: Music, Mind, and Brain. Advance online publication.
http://dx.doi.org/10.1037/pmu0000193
Psychomusicology: Music, Mind, and Brain 2017 American Psychological Association
2017, Vol. 1, No. 2, 000 0275-3987/17/$12.00 http://dx.doi.org/10.1037/pmu0000193
Although the human ability to recognize emotions in vocal speech utterances with reasonable accuracy
has been well documented in numerous studies, little research has been reported on emotion recognition
from emotional expression in the singing voice. This paper is the first to examine this issue by asking
internationally known professional opera singers to portray 9 major emotions by singing sequences of
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
nonsense syllables on the standard musical scale. We then asked more than 500 hundred listener/judges
This document is copyrighted by the American Psychological Association or one of its allied publishers.
from different cultures with a wide range of musical preferences and degree of musical knowledge to
recognize the intended emotions from the voice recordings. The data show that listeners are indeed able
to recognize emotions expressed in singing with better-than-chance accuracy. In addition, we find some
evidence that there seem to be only minor effects of culture or language on the ability to recognize the
emotional interpretations. Some emotions are more easily recognized than others are. Overall, recogni-
tion ability from the singing voice compares well to accuracy rates in studies using speaking. Judges
clearly use the differential acoustic patterns of sound generated by the singers in their performance to
infer the emotion expressed, as demonstrated by comparing the recognition rates for different emotions
to results of statistical classification based on acoustic parameters. We also attempt to explore the nature
of the inference process by examining, using path models, the major acoustic variables involved and the
inference from subjectively perceived configurations of voice quality.
Keywords: emotion recognition, emotion expression in singing, music and emotion, singing voice
It has been suggested that language and music have coevolved lutionarily, the advantage of emotional expression is that it allows
from primitive affect bursts, with nonverbal singing possibly pre- better understanding of the emotional reactions of others and thus
ceding speech (Brown, 2000; Mithen, 2005; Scherer, 1991, 2013a, helps shape ones reactions appropriately for the situation. Conse-
2013b). This seems like a reasonable hypothesis, given that evo- quently, it is of great interest to examine (a) how well emotions can
The work reported here is original and has not been published
Klaus R. Scherer, Department of Psychology, University of Geneva; elsewhere. Some selected results have been presented as illustrat-
Stphanie Trznadel, Center for Affective Sciences, University of Geneva; ions in the Klaus R. Scherers keynote speeches at scientific meeti-
Bernardino Fantini, Faculty of Medicine, University of Geneva; Johan ngs.
Sundberg, Department of Speech Music Hearing, School of Computer The work reported here was conducted by members of the Music and
Science and Communication, KTH Stockholm. Emotion Focus of the Swiss Center for Affective Sciences (Klaus R.
KLAUS R. SCHERER Founding Director of the Swiss Center for Scherer, Bernardino Fantini, Eduardo Coutinho, and their collaborators).
Affective Sciences, is Emeritus Professor at the Department of Psychology, The research was funded by an ERC Advanced Grant in the European
University of Geneva and Honorary Professor at the Department of Psy- Communitys 7th Framework Programme under grant agreement 230331-
chology, University of Munich. He has developed an appraisal theory of
PROPEREMO (Production and perception of emotion: an affective sci-
emotion and conducted many empirical investigations on the theoretical
ences approach) to Klaus R. Scherer and by the National Center of
predictions and on vocal and musical expression.
Competence in Research (NCCR) Affective Sciences financed by the
STPHANIE TRZNADEL obtained a masters degree in neurosciences
Swiss National Science Foundation (51NF40-104897) and hosted by the
at the University of Geneva and participates in ongoing research at the
Swiss Center for Affective Sciences. University of Geneva. We thank the opera singers for their collaboration
BERNARDINO FANTINI is an Emeritus Professor at the University of and Lucas Tamarit for help with the recording set-up. We also acknowl-
Geneva where he headed the Institute for the History of medicine. He edge precious support from Annett Schirmer at the National University of
writes on issues of musical and aesthetic emotions and directs a classical Singapore and Jamin Halberstadt at the University of Otago in New
music festival in Geneva. Zealand.
JOHAN SUNDBERG is an Emeritus Professor at the Department of Correspondence concerning this article should be addressed to Klaus R.
Speech Music Hearing, School of Computer Science and Communication, Scherer, Department of Psychology, University of Geneva, Boulevard du
KTH Stockholm. He is a leading expert on the acoustics of the singing Pont-d=Arve, 40, CH-1211 Geneva, Switzerland. E-mail: klaus.scherer@
voice. unige.ch
1
2 SCHERER, TRZNADEL, FANTINI, AND SUNDBERG
be inferred from vocal utterances, (a) whether similar expression try to separate the role of the lyrics and the music from purely vocal
production patterns can be found in the speaking and the singing aspects of expression, and examine the extent to which listeners are
voice, and (a) whether emotions are equally well-recognized from able to identify a singers expressive intention to communicate spe-
the singing and the speaking voice. Research in this tradition, cific emotions by voice quality alone and how this compares to
focused almost exclusively on the speaking voice, has a long emotion recognition from the speaking voice. We were able to recruit
history and has produced a large body of empirical results several world-class opera singers to vocally portray different emotions
(Coutinho, Scherer, & Dibben, 2014; Juslin & Laukka, 2003; by singing a series of nonsense syllables and schwa sounds ([]),
Juslin & Scherer, 2005; Pell & Kotz, 2011; Scherer, 1995, 2003; using the normal musical scale as a carrier. These expressive recita-
Scherer, Johnstone, & Klasmeyer, 2003). In most of these studies, tions were comprehensively analyzed for the underlying acoustic
actors have been asked to portray a number of different emotions structures by using advanced extraction techniques for a standard
by producing speech utterances with standardized or nonsense parameter set (see Scherer et al., in press for a detailed report on the
content. Groups of listeners are asked to recognize the portrayed acoustic analyses).
emotions. They are generally required to indicate the perceived We used the high-quality recordings in this singing emotion
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
emotion on rating sheets with standard lists of emotion labels, corpus to conduct a number of judgment studies. The recordings
This document is copyrighted by the American Psychological Association or one of its allied publishers.
Emotion Differences Laukka, 2001) were among the first to use this model to study cue
utilization in emotion communication in music performances and
As in similar studies on emotion recognition from actor portray- for the encoding and decoding of vocal emotions. In an early study
als in speech samples, we were interested in determining not only
on the expression and perception of personality in the speaking
to what extent the accuracy of recognition would exceed chance
voice, Scherer (1978) proposed and tested an extension of the lens
levels but also whether certain emotions would be more accurately
model in which the cue domain is separated into (a) distal, objec-
recognized than others. A central question concerns the nature of
tively measurable cues (such as acoustic voice parameters for the
the most frequent confusions because these patterns can point to
speaker) and (b) subjective, proximal percepts of these cues (such
commonalities in the underlying acoustic profiles for specific
as voice quality impressions formed by the listener). The major
emotions. In this context, it is of particular interest to compare the
justification for this extension is that in perception and communi-
recognition level and the confusion matrix with the findings in
cation, the objectively measurable cues in vocal behavior are
studies on emotion recognition from speech. Another approach to
subject to a transmission process from sender to receiver (which
examine the role of the acoustic concomitants of emotion expres-
often adds noise) and need to be processed and adequately trans-
sion and recognition in singing is to attempt to automatically
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Figure 1. Graphic illustration of the tripartite emotion expression and perception (TEEP) model.
4 SCHERER, TRZNADEL, FANTINI, AND SUNDBERG
speech/voice signal, and the third and last step the inferential ware (Eyben et al., 2015) for the automatic extraction of an
utilization and emergence of an emotion attribution. extended parameter set from the recordings in the singing corpus
In this article, we report the results of this work on perception/ described above. Details of the acoustic analyses are provided in
recognition of emotion in the singing voice. The major questions Scherer et al. (in press), describing the production aspect of the
examined are as follows: study, and in the supplemental material to this article. In the
present work, we refer to the data reported in that article, in
(1) Are listeners able to recognize the intended expressive particular six major acoustic scales based on principal component
target with better-than-chance accuracy? Are there indi- analyses: Loudness (different indicators of high vocal intensity or
vidual differences between listeners? Does culture or energy), Dynamics (tempo, mean perturbation, and steep rise/fall
language affect the ability to recognize the emotional slopes for F0 and loudness), Perturbation Variation (variability of
interpretations? jitter, shimmer, and harmonic-to-noise ratio as measured by the
coefficient of variation), Low Frequency Energy (proportion of
(2) Are some emotions more easily recognized than others?
energy in lower ranges of the spectral distribution compared to
How do the recognition results compare with accuracy
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
To accommodate such a large number of judges, we had to familiarize themselves with the task. After the three example trials,
conduct the judgment study in the form of a web application. they listened to one of the two lists with 33 audio stimuli presented
Although the findings are expected to be very stable, given the in random order and chose what they considered the most appro-
large N, there are limitations with respect to the comparability of priate emotion label by clicking on the corresponding button. At
the conditions under which participants listened to the stimuli. In the end of the test, participants received feedback with their overall
addition, there are limitations related to the scarcity of personal score for the task.
information; because of the need for complete anonymity, we
asked only for age, gender, and whether the person was a student
Voice Quality Ratings
or not (and for some student samples, nationality and language).
Despite these limitations, the data reported under Results allow us Ratings for the proximal voice cues of the opera singers vo-
to discuss preliminary answers to the questions posed at the end of calizations were collected from an additional group of participants
the introduction. in order to investigate the effects of voice quality on emotion
Design of the recognition test. We decided to use only the perception.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
nonsense-syllable sentence for the recognition test, as it seems Participants. Nineteen individuals (63% females, 5.3% over
This document is copyrighted by the American Psychological Association or one of its allied publishers.
closer to the kind of singing listeners are used to rather than the 40 years) were recruited through advertisements at the University
simple scale using schwa [] sounds. Each emotion was to be of Geneva to participate in exchange for a small remuneration.
represented by vocalizations from each of the singers. We Procedure. The ratings were collected in a single session in a
decided not to select the stimuli to be included in the testing set classroom at the University of Geneva, where all participants were
on the basis of the presumed quality of the productions by the invited to come at the same time. Audio recordings were played on
different singers. The reason is that emotional interpretation in a laptop computer connected to the classroom loudspeakers, and
singing is determined by the intuition and intention of the the volume was adjusted to a comfortable level. Each participant
individual singer, which are subjective by definition. In addi- was given eight (one per singer) paper copies of the French version
tion, there is no body of expertise that could serve as objective of the Geneva Voice Perception Scales (GVPS; Bnziger, Patel, &
criteria for an expert panel to make such a selection. Singers were Scherer, 2014). The GVPS consists of eight linear and continuous
given the opportunity to repeat the recordings of individual emo- two-dimensional scales ranging from one end to the other, repre-
tions until they were satisfied with their interpretation. However, senting eight different characteristics of the voice: pitch, loudness,
there is undoubtedly some variation in the expressivity of the modulation of intonation, speech rate, articulation, (in)stability,
different singing samples, especially as the singers sometimes roughness, and sharpness. After receiving instructions about the
mentioned that they found it easier to portray certain emotions experimental task, participants were presented with 16 example
rather than others. audio recordings illustrating each end of the eight scales (low
As the first three singers recorded did not interpret three of the pitch, high pitch, low volume, high volume, etc.) to ensure that
emotions on the final list, the total number of items for the test was they understood all the labels on the scales. The example stimuli
63. Because of the large number of stimuli, we divided them into consisted of a male voice uttering the sentence I cannot believe it
two lists, and participants received one of the two lists. Three (in French: Je ne peux pas le croire), emphasizing both extremes
stimuli were included in both List 1 and 2 to determine potential of each characteristic. After hearing the example stimuli, partici-
group differences. The list presented to each participant was coun- pants were presented with the recordings of the opera singers,
terbalanced every time a new participant clicked on the link to start which they had to rate using the GVPS. For each recording,
the survey. participants were asked to mark the position of the voice on all
Procedure. Participants were provided with a link to a server eight scales. Each recording was replayed repeatedly while it was
that automatically started the recognition test described above in rated so raters could base their judgment on continuous exposure
their own language. Participants in the web volunteer sample could to the respective voice. When all participants had finished, the next
choose their preferred language (English, French, or German), but recording was presented in the same fashion. The recordings were
in all other cases, the language was imposed. Participants were free presented one singer after the other; that is, we played all the
to complete the test from their home or the university, and they recordings for one singer before moving on to the next singer, and
could choose to use either headphones or speakers to listen to the the order of presentation was the same for each singer. We did not
audio stimuli. Before starting the task, participants were instructed randomize the order of singers and of recordings because (1) the
that they would hear eight internationally renowned opera singers vocal range (F0 range, tessitura) of the different voice types (from
expressing nine different emotions through meaningless sound counter tenor to bass-baritone) is extremely different and, as F0
sequences and that they would choose, for each vocalization, the strongly influences voice quality judgment, we preferred listeners
emotion word that they considered closest to what the singer to focus on the differences within a certain range and avoid sudden
intended to express. The buttons for the nine response options jumps between stimuli; and 2) randomization of stimulus presen-
(anger, despair, fear, joy, love, pride, sadness, serenity, and ten- tation introduces random effects that will even out for a large
derness) were arranged in three columns of three response buttons, number of stimuli but may adversely affect judgments of a re-
and participants had to click on the appropriate button. The pre- strained number of stimuli. Participants provided 63 8 504
sentation order of the labels on the response buttons remained the ratings. After the experimental session, ratings were transferred
same throughout the experiment. If participants were uncertain from the analog scales on the paper questionnaires to numbers, by
about their response, they had the option to replay the stimulus. dividing the scales into five segments of equal length. Thus, each
After adjusting the audio settings to comfortable listening condi- analog rating received a score ranging from 1 (left end of the scale)
tions, participants were presented with three example stimuli to to 5 (right end of the scale). Interrater reliability was calculated by
6 SCHERER, TRZNADEL, FANTINI, AND SUNDBERG
using Cronbachs coefficient, showing very high reliability of Maori group at the lower end and the U.S. panel and the web
the ratings (Cronbachs .926 for the whole sample). volunteers at the high end (Table S2a). Age differences, F(3,
550) 4.05, p .007, 2 .022, showed the oldest group (older
Results than 60 years) to have significantly lower accuracy levels based on
the post hoc test (Table S2b; but note that the effect size is low, and
Data Analysis only 4.1% of the participants are in this class). A significant but
weak Gender Student status interaction effect emerged, F(1,
All participants who finished the test were included in the 550) 6.07, p .014, 2 .011, showing female students to be
analysis. One-way analyses of variance (ANOVAs) for each of the more accurate judges than their male counterparts, whereas there
three stimuli presented in both lists revealed no significant differ- was little difference for nonstudent participants. A multivariate
ences, suggesting that the two subgroups who heard different ANOVA for an 8 2 4 2 (Group Gender Age
stimulus sets can be combined in the same analyses. Student status) comparison did not show sizable significant differ-
ences (see Table S3 in supplemental material).
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Table 1
Accuracy of Judgement in Percentage by Participant Groups and Emotions (Theoretical Chance Level 11.1%)
Group Anger Contempt Fear Joy Love Pride Sadness Serenity Tender-ness Total
Web volunteers (EN, FR, GE) 49.3 20.9 44.8 34.6 13.1 41.4 39.8 21.6 28.7 34.1
U.S. Survey EN 46.9 21.5 44.5 31.7 15.6 42.0 39.0 27.9 28.2 34.8
Uni Geneva FR 44.4 22.2 37.3 23.0 10.4 30.8 39.6 22.2 31.3 30.0
Uni NZ Otago European EN 42.9 14.1 33.4 26.4 11.1 34.3 44.1 17.6 20.2 29.3
Uni NZ Otago Maori EN 27.7 2.5 35.1 36.6 12.8 27.3 34.6 22.5 19.5 25.7
Uni Singapore Chinese EN 49.5 17.5 30.1 17.4 13.0 28.8 32.4 26.1 18.0 27.3
Uni Singapore Chinese CH 47.0 14.2 36.4 22.0 18.4 33.7 38.8 22.5 23.9 29.5
Chance level corrected for response bias 9.4 10.4 8.4 10.0 9.5 14.2 14.9 11.6 11.5
Note. Uni recruitment via university postings; ethnic origin: European New Zealanders with European descent, Maori New Zealanders with Maori
descent, Chinese Singaporeans with Chinese descent; languages: CH Chinese; EN English; FR French; GE German.
RECOGNIZING EMOTIONS IN THE SINGING VOICE 7
Table 2
Confusion Matrix of Emotion Judgements vs. Correct Answers for All Groups of Judges Combined (in %), Correct Targets Compared
With the Recognition Levels for Actor Emotion Portrayals in Nonsense Sentences
Anger 48.6 (59) 12.9 4.4 5.4 2.7 8.8 0.3 0.5 0.5
Contempt 21.8 19.0 (24) 5.7 10.7 10.1 15.0 2.5 4.3 4.3
Fear 4.3 2.8 43.2 (59) 5.5 4.6 3.9 6.8 2.0 2.8
Joy 7.4 9.0 8.2 30.1 (28) 10.0 15.3 2.3 4.1 4.1
Love 0.8 8.2 5.7 6.2 13.8 () 6.7 10.6 15.3 18.5
Pride 14.3 28.3 4.2 19.5 14.8 37.3 (16) 1.5 4.6 3.6
Sadness 1.3 3.5 16.7 11.9 16.4 4.9 39.2 (18) 21.7 18.5
Serenity 1.5 10.0 5.3 5.3 16.3 5.1 15.8 24.4 () 21.1
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Tenderness 0.1 6.4 6.5 5.4 11.3 2.9 21.0 23.1 26.6 (23)
This document is copyrighted by the American Psychological Association or one of its allied publishers.
Chance level corrected for response bias 9.4 10.4 8.4 10.0 9.5 14.2 14.9 11.6 11.5
Note. Average accuracy of singer judgments across emotions: 33.1%; (xx) accuracy percentages achieved for vocal channel of actor speech portrayals
of emotions (from column A under Core set in Table 5 in Bnziger et al., 2012), average accuracy 32.4%; numbers in italics 10%; Boldface indicates
percentage of correct judgments (hit rate).
In contrast, the hit rates for contempt, love, and serenity were close frequently than others, the chance level were corrected upward).
to chance largely because of frequent confusions with other emo- Joy and tenderness attained a similar level.
tions. This impression was confirmed by an ANOVA, F(8, 62) Given the low recognition rate for serenity, love, and contempt,
2.575, p .019, 2 .276, with post hoc analysis for emotion in the remainder of this article we focus on the six more estab-
differences that placed these three emotions together at the very lished emotions with more satisfactory emotion communication
low end of Subset 1 (i.e., maximally different from with anger at potential in the singing voice. Recomputing the average recogni-
the high end of Subset 2; see Table S4 in supplemental material). tion rate using only the accuracy percentages for the six best
How does this compare with the accuracy percentages found for emotions yields an accuracy percentage of 37.0%. It is interesting
actors portrayals of emotion in the speaking voice? Scherer et al. to compare this accuracy percentage for human judges with the
(2011) reviewed the empirical research findings on emotion de- accuracy of an automatic emotion classification/recognition based
coding from vocal expression in speech and reported the following on acoustic parameters. As described in the Method section, the
mean accuracies for the major emotions: anger 74.9%, fear 62.4%, vocal stimuli discussed here have been extensively analyzed with
joy/happiness 54.0%, and sadness 74.9%. However, these accu- the Geneva Minimalistic Acoustic Parameter Set acoustic param-
racy percentages are not directly comparable with the data pre- eter extraction tool (Eyben et al., 2016). The results are reported in
sented here, as in many studies in the literature only a few a separate paper (Scherer et al., in press) in which a multiple
maximally different basic emotions have been studied, which discriminant analysis with five major acoustic factors has been
reduces the probability of confusions between emotions with sim- computed. The resulting confusion matrix of the classification is
ilar valence or arousal levels and encourages the use of simple reproduced in Table 3 to allow comparison with the human recogni-
exclusion rules rather than direct recognition. Bnziger et al. tion patterns reported above. The overall accuracy of the automatic
(2012), who used the core set (about 150 actor expressions for 18 classification is 43.5%, compared with 37.0% found for the human
emotions) of the Geneva Multimodal Emotion Portrayals corpus to judges. A close inspection of the confusion matrix in Table 3 shows
conduct a recognition study, have reported results that are more that this difference is largely due to the relatively lower accuracy
comparable to the present data. Comparability is even more as- percentages in the human judgments for sadness and tenderness on the
sured by the fact that the same nonsense-syllable utterance was one hand and pride on the other. The reason for this is probably that
used in both studies, either spoken or sung. The results of the the human judges had nine categories to choose from and the dis-
ratings of the audio channel of these clips only are also shown in criminant analysis only six. Given that in the original judgment study
Table 2 (shown in parentheses next to the hit rates). The respective there were strong confusions of both sadness and tenderness with
values are not directly comparable, as the raters in the Bnziger et serenity and love and of pride with contempt (see Table 2), the overall
al. (2012) study had to choose among 18 alternatives, which accuracy is likely to be higher for the human judges than for the
reduces the chance level by half and is likely to increase confu- machine classification. This assumption is confirmed by the fact that
sions. On the other hand, the clips in the core set had been chosen some of the machine confusions are much larger and more inappro-
on the basis of emotion ratings of a much larger (1,000 items) priate than the corresponding human confusions, for example, in the
corpus of expressions, a preselection that should reduce the num- case of joy being classified as anger or the mutual confusions of
ber of confusions and increase the likelihood of correct recogni- tenderness and fear.
tion. The comparison of the values in the diagonal of the confusion
matrix in Table 2 suggests that judges were better able to recognize
Modeling the Recognition Mechanism
anger and fear from the speaking compared with the singing voice,
whereas sadness and pride were better recognized in the singing Bnziger, Hosoya, and Scherer (2015) successfully used the
voice (however, as these two response categories were used more TEEP model to analyze the complete process of the encoding of
8 SCHERER, TRZNADEL, FANTINI, AND SUNDBERG
Table 3
Confusion Matrix (% Accuracy) for Machine Classification of Singers Emotion Portrayals in
Comparison With Human Judgment
emotions by actors to the decoding by naive raters (judging the Table 1, the best recognized of the singers emotion interpretations
intended emotions), using structural equation modeling to param- and surpasses the accuracy level achieved by multiple discriminant
etrize the model and test the goodness of fit. As structural equation analysis (; see Table 3). The correlation between anger expressed
modeling (and to some extent also hierarchical regression proce- and anger inferred is r .86 (p .01). The TEEP model illustrates
dures) requires a large number of observations to yield reliable the mechanism that produced this excellent result: On the distal
results, we could not use these techniques, as we had only a side, the acoustic measurements of high loudness, high dynamics
relatively small number of observations (48 singing samples, re- (rate, F0 contour, loudness variation) and weak low frequency
stricting the analysis to the major six emotions). In consequence, energy correlate with singer portrayals of anger. These acoustic
we use descriptive graphs with Pearson correlations between the characteristics are correctly perceived by human judges (as high
elements of the model to visualize potential models and develop volume, variable intonation, rapid rate/tempo, and low vocal in-
hypotheses for future research. Figures 2 and 3 illustrate this stability), indicating that the appropriate cues for the inference of
procedure for anger and sadness. the underlying emotion are available. We assume that the ratings
For the material reported below, we adopted dichotomous vari- of instability reflect the difference between adjacent F0 periods,
ables for the expressed emotions (e.g., anger expressed 1 and regardless of whether it is caused by random variation or nonran-
other emotions expressed 0) to allow correlational analysis. dom variation such as vibrato (at least in classical singing, random
Figure 2 shows the TEEP model for anger, which is, as shown in variation would mostly be interpreted as a sign of poor vocal
Figure 2. Tripartite emotion expression and perception model of anger inference from the singing voice.
RECOGNIZING EMOTIONS IN THE SINGING VOICE 9
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
Figure 3. Tripartite emotion expression and perception model of sadness inference from the singing voice.
control). On the proximal side, the use of these cues by the judges The case of sadness inference, illustrated by the TEEP model
inferring the emotions interpreted by the singers corresponds ex- shown in Figure 3, is somewhat different. The correlation between
actly to relationships on the distal side. In other words, anger sadness expressed and sadness inferred is r .58 (p .01). Here
portrayals have clear acoustic concomitants, the respective acous- the frequent confusions with tenderness, serenity/calm, and love
tic parameters are correctly perceived, and the inference rules (shown in Table 2) seem to be due to a major discrepancy between
mirror the distal expression patterns. However, this appropriate the distal expression and the proximal inference: Judges use cues
configuration is not sufficient to obtain a recognition rate higher of a low level of dynamics as a general cue for sadness, tenderness,
than 50% because of the confusions with contempt and pride. serenity/calm, and love. However, on the distal side, the correla-
Although the confusion with contempt may be explained by the tion between sadness expressed and the acoustic dynamics com-
fact that contempt is sometimes blended with anger, the confusion ponent, while indeed negative, is rather low and nonsignificant.
with pride is clearly due to the lack of acoustic parameters in the Thus, although judges correctly perceive low loudness and slow
model that capture the central difference in positive and negative rate/tempo in sadness portrayals, they may overgeneralize the
valence between anger and pride. It seems that vocal expression covariation of these two parameters with low dynamics (as is
lacks powerful cues for valence discrimination (especially in com- suggested by the high correlations between acoustic dynamics and
parison with facial expression where smiling [zygomaticus activ- judged volume, intonation, and rate in the transition part of the
ity], with or without accompanying speech activity, is a powerful model). Figure 4 illustrates the case of joy. The correlation be-
and ubiquitous signal for positive valence; Matsumoto, Keltner, tween joy expressed and joy inferred is r .58 (p .01). Here, the
Shiota, Frank, & OSullivan, 2008; Scherer et al., 2011). Because lower recognition rate is probably due to judges using the proximal
the 0/1 coding of anger expression (other emotion/anger) as a cue of variable intonation (produced by the dynamics factor) to
dichotomous variable correlates with r .86, with the percentage infer joy. However, there is no strong distal relationship by joy as
of judges having chosen anger to label the respective interpreta- expressed by the singers and a high level of dynamics.
tions, at least some weak valence cues may be available but are not The discussion of Figures 2 to 4 was meant to provide an
measured by our current parameter set. example of how, with the help of the TEEP model, one might
Figure 4. Tripartite emotion expression and perception model of joy inference from the singing voice.
10 SCHERER, TRZNADEL, FANTINI, AND SUNDBERG
arrive at a better understanding of the process of vocal communi- nized in the singing voice. Pride, which is rarely studied, also
cation of emotion in singing, as well as the probability of success- achieves a remarkable accuracy score. In contrast, contempt, love,
fully conveying an emotional interpretation of a character in a and serenity, which are not infrequently encountered in operas, are
particular situation to the listener. relatively close to chance level. As to contempt, this emotion is
also frequently confused with anger in spoken expressions. As to
Discussion and Conclusion love and serenity, the vocal expressions of the emotions are close
We have reported what we believe to be the first systematic to those of sadness and tenderness (low vocal energy and slow
empirical investigation of the extent to which nonprofessional tempo), and thus frequent confusions occur with the latter two
listeners can recognize emotional interpretations of musical mate- emotions. It seems likely that sadness and tenderness are penalized
rial by singers. In designing the study, we made the choice to use by these confusions and would probably obtain better accuracy if
highly professional opera singers to ensure a sufficient degree of love and serenity were not provided as potential alternatives.
expertise and experience of encoding a wide variety of emotional
(2) How does the recognition ability compare with accuracy
expressions, recorded under studio conditions, to obtain stimulus
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
tenderness because of the reliance on similar acoustic parameters emotional expression on the stage (see Scherer, 2013c) and how
(low vocal energy and slow tempo; Figure 3). Similarly, the this expression potentially differs from professional mimicking
classifier was more accurate for pride, as it did not have the choice that does not involve any emotional participation. The results
of contempt, which was often confused with pride by the human obtained in our attempt to model the inference mechanism under-
judges (see Table 2). Humans also confused pride with anger more lying emotion recognition provides important insights and encour-
frequently than was the case for automatic classification based on ages further efforts in the direction of more complex research
acoustic cues. Interestingly, humans were slightly better in recog- designs, combining the study of expression and impression (rec-
nizing joy, in part because they rarely confused it with anger as the ognition) of emotion in music and particularly singing.
classifier was prone to do (because both anger and joy were
characterized by high vocal energy and fast tempo; Figures 2 and References
4). Apparently, human judges have access to some cues that
distinguish joy and anger with respect to positive and negative Balkwill, L.-L., & Thompson, W. F. (1999). A cross-cultural investigation
of the perception of emotion in music: Psychophysical and cultural cues.
valence (despite the fact that the voice is not ideally suited to
Music Perception, 17, 43 64. http://dx.doi.org/10.2307/40285811
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
express valence). However, none of the acoustic parameters fed Balkwill, L.-L., Thompson, W. F., & Matsunaga, R. (2004). Recognition of
This document is copyrighted by the American Psychological Association or one of its allied publishers.
into the automatic classification seem to provide this information. emotion in Japanese, Western, and Hindustani music by Japanese lis-
This reflects a common problem frequently reported in the litera- teners. Japanese Psychological Research, 46, 337349. http://dx.doi
turethe difficulty in finding valid vocal cues for valence (as .org/10.1111/j.1468-5584.2004.00265.x
compared with power [vocal energy] and arousal [tempo]). The Bnziger, T., Grandjean, D., & Scherer, K. R. (2009). Emotion recognition
fact that human judges apparently use valence cues encourages from expressions in face, voice, and body: The Multimodal Emotion
further attempts to identify valence cues in work on vocal expres- Recognition Test (MERT). Emotion, 9, 691704. http://dx.doi.org/10
sion. .1037/a0017088
Bnziger, T., Hosoya, G., & Scherer, K. R. (2015). Path models of vocal
(3) What is the nature of the process involved in inferring emotion communication. PLoS ONE, 10, e0136675. http://dx.doi.org/10
.1371/journal.pone.0136675
emotions from a singing sample? What are the major
Bnziger, T., Mortillaro, M., & Scherer, K. R. (2012). Introducing the
acoustic variables involved, and how is their perception Geneva Multimodal Expression Corpus for experimental research on
mediated by perceived configurations of voice quality? emotion perception. Emotion, 12, 11611179. http://dx.doi.org/10.1037/
a0025827
The nature of the expression and perception/inference processes Bnziger, T., Patel, S., & Scherer, K. R. (2014). The role of perceived voice
underlying emotion recognition in the singing voice was the object and speech characteristics in vocal emotion communication. Journal of
of the third question investigated in this article. We used the TEEP Nonverbal Behavior, 38, 3152. http://dx.doi.org/10.1007/s10919-013-
model, inspired by Brunswiks lens model, to illustrate how the 0165-x
type of data presented here can be used to investigate these Brown, S. (2000). The musilanguage model of music evolution. In N. L.
processes. However, we could only illustrate the theoretical model Wallin, B. Merker, & S. Brown (Eds.), The origins of music (pp.
with correlation coefficients rather than fitting an exact statistical 271300). Cambridge, MA: MIT Press.
Brunswik, E. (1956). Perception and the representative design of psycho-
model because of the relatively low number of observations re-
logical experiments. Berkeley, CA: University of California Press.
sulting from using experienced internationally known opera sing-
Cochrane, T., Fantini, B., & Scherer, K. R. (Eds.). (2013). The emotional
ers. The three cases discussed show several examples of highly power of music. Oxford, United Kingdom: Oxford University Press.
functional communication processes (allowing high accuracy) in http://dx.doi.org/10.1093/acprof:oso/9780199654888.001.0001
which valid distal cues were correctly transmitted to the listener as Coutinho, E., Scherer, K. R., & Dibben, N. (2014). Singing and emotion.
proximal cues, and the latter were used in an appropriate fashion In G. Welch, D. M. Howard, & J. Nix (Eds.), The Oxford handbook of
for the inference (mirroring the distal relationship). The dysfunc- singing (pp. 119). Oxford, United Kingdom: Oxford University Press.
tional links identified in Figures 2 to 4 concern mostly the faulty Eyben, F., Salomo, L. G., Sundberg, J., Scherer, K. R., & Schuller, B. W.
interpretation of proximal cues that do not have sufficient validity (2015). Emotion in the singing voice: A deeper look at acoustic features
as distal indicators of the underlying emotional expression. Other in the light of automatic classification. EURASIP Journal on Audio,
Speech, and Music Processing, 2015. http://dx.doi.org/10.1186/s13636-
possible dysfunctions include faulty transmission of distal cues to
015-0057-6
the proximal side. We strongly believe that the systematic use of Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., Andr, E., Busso,
the TEEP model in cases in which a sufficient number of appro- C., . . . Truong, K. P. (2016). The Geneva Minimalistic Acoustic
priate observations is available should allow the establishment of Parameter Set (GeMAPS) for voice research and affective computing.
the goodness of fit of theoretically predicted expression-inference IEEE Transactions on Affective Computing, 7, 190 202. http://dx.doi
relationships with techniques such as hierarchical regression or .org/10.1109/TAFFC.2015.2457417
structural equation modeling (Bnziger et al., 2015). Foulds-Elliott, S. D., Thorpe, C. W., Cala, S. J., & Davis, P. J. (2000).
In conclusion, we hope to have shown the utility and feasibility Respiratory function in operatic singing: Effects of emotional connec-
of studying emotional communication in the singing voice, a tion. Logopedics, Phoniatrics, Vocology, 25, 151168. http://dx.doi.org/
10.1080/140154300750067539
largely neglected area of research. The results of further studies in
Fritz, T., Jentschke, S., Gosselin, N., Sammler, D., Peretz, I., Turner, R.,
this domain would not only enrich our general knowledge about . . . Koelsch, S. (2009). Universal recognition of three basic emotions in
emotion communication in the vocal channel, but would also help music. Current Biology, 19, 573576. http://dx.doi.org/10.1016/j.cub
researchers empirically study issues in the psychology of music .2009.02.058
that have so far eluded empirical scrutiny. One example concerns Juslin, P. N. (2000). Cue utilization in communication of emotion in music
the degree to which singers can successfully produce authentic performance: Relating performance to perception. Journal of Experi-
12 SCHERER, TRZNADEL, FANTINI, AND SUNDBERG
mental Psychology: Human Perception and Performance, 26, 1797 Scherer, K. R. (2013a). Affect bursts as evolutionary precursors of speech
1812. http://dx.doi.org/10.1037/0096-1523.26.6.1797 and music. In G. A. Danieli, A. Minelli, & T. Pievani (Eds.), Stephen J.
Juslin, P. N., & Laukka, P. (2001). Impact of intended emotion intensity on Gould: The scientific legacy (pp. 147167). Milan, Italy: Springer-
cue utilization and decoding accuracy in vocal expression of emotion. Verlag. http://dx.doi.org/10.1007/978-88-470-5424-0_10
Emotion, 1, 381 412. http://dx.doi.org/10.1037/1528-3542.1.4.381 Scherer, K. R. (2013b). Emotion in action, interaction, music, and speech.
Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal In M. A. Arbib (Ed.), Language, music, and the brain: A mysterious
expression and music performance: Different channels, same code? relationship (pp. 107140). Cambridge, MA: MIT Press. http://dx.doi
Psychological Bulletin, 129, 770 814. http://dx.doi.org/10.1037/0033- .org/10.7551/mitpress/9780262018104.003.0005
2909.129.5.770 Scherer, K. R. (2013c). The singers paradox: On authenticity in emotional
Juslin, P. N., & Scherer, K. R. (2005). Vocal expression of affect. In J. A. expression on the opera stage. In T. Cochrane, B. Fantini, & K. R.
Harrigan, R. Rosenthal, & K. Scherer (Eds.), The new handbook of Scherer (Eds.), The emotional power of music (pp. 5573). Oxford,
methods in nonverbal behavior research (pp. 65135). Oxford, United United Kingdom: Oxford University Press. http://dx.doi.org/10.1093/
Kingdom: Oxford University Press. acprof:oso/9780199654888.003.0005
Matsumoto, D., Keltner, D., Shiota, M. N., Frank, M. G., & OSullivan, M. Scherer, K. R., Clark-Polner, E., & Mortillaro, M. (2011). In the eye of the
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
(2008). Whats in a face? Facial expressions as signals of discrete beholder? Universality and cultural specificity in the expression and
perception of emotion. International Journal of Psychology, 46, 401
This document is copyrighted by the American Psychological Association or one of its allied publishers.