Sei sulla pagina 1di 13

Psychomusicology: Music, Mind, and

Brain
Recognizing Emotions in the Singing Voice
Klaus R. Scherer, Stphanie Trznadel, Bernardino Fantini, and Johan Sundberg
Online First Publication, October 23, 2017. http://dx.doi.org/10.1037/pmu0000193

CITATION
Scherer, K. R., Trznadel, S., Fantini, B., & Sundberg, J. (2017, October 23). Recognizing Emotions in
the Singing Voice. Psychomusicology: Music, Mind, and Brain. Advance online publication.
http://dx.doi.org/10.1037/pmu0000193
Psychomusicology: Music, Mind, and Brain 2017 American Psychological Association
2017, Vol. 1, No. 2, 000 0275-3987/17/$12.00 http://dx.doi.org/10.1037/pmu0000193

Recognizing Emotions in the Singing Voice

Klaus R. Scherer, Stphanie Trznadel, Johan Sundberg


and Bernardino Fantini KTH Stockholm
University of Geneva

Although the human ability to recognize emotions in vocal speech utterances with reasonable accuracy
has been well documented in numerous studies, little research has been reported on emotion recognition
from emotional expression in the singing voice. This paper is the first to examine this issue by asking
internationally known professional opera singers to portray 9 major emotions by singing sequences of
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

nonsense syllables on the standard musical scale. We then asked more than 500 hundred listener/judges
This document is copyrighted by the American Psychological Association or one of its allied publishers.

from different cultures with a wide range of musical preferences and degree of musical knowledge to
recognize the intended emotions from the voice recordings. The data show that listeners are indeed able
to recognize emotions expressed in singing with better-than-chance accuracy. In addition, we find some
evidence that there seem to be only minor effects of culture or language on the ability to recognize the
emotional interpretations. Some emotions are more easily recognized than others are. Overall, recogni-
tion ability from the singing voice compares well to accuracy rates in studies using speaking. Judges
clearly use the differential acoustic patterns of sound generated by the singers in their performance to
infer the emotion expressed, as demonstrated by comparing the recognition rates for different emotions
to results of statistical classification based on acoustic parameters. We also attempt to explore the nature
of the inference process by examining, using path models, the major acoustic variables involved and the
inference from subjectively perceived configurations of voice quality.

Keywords: emotion recognition, emotion expression in singing, music and emotion, singing voice

Supplemental materials: http://dx.doi.org/10.1037/pmu0000193.supp

It has been suggested that language and music have coevolved lutionarily, the advantage of emotional expression is that it allows
from primitive affect bursts, with nonverbal singing possibly pre- better understanding of the emotional reactions of others and thus
ceding speech (Brown, 2000; Mithen, 2005; Scherer, 1991, 2013a, helps shape ones reactions appropriately for the situation. Conse-
2013b). This seems like a reasonable hypothesis, given that evo- quently, it is of great interest to examine (a) how well emotions can

The work reported here is original and has not been published
Klaus R. Scherer, Department of Psychology, University of Geneva; elsewhere. Some selected results have been presented as illustrat-
Stphanie Trznadel, Center for Affective Sciences, University of Geneva; ions in the Klaus R. Scherers keynote speeches at scientific meeti-
Bernardino Fantini, Faculty of Medicine, University of Geneva; Johan ngs.
Sundberg, Department of Speech Music Hearing, School of Computer The work reported here was conducted by members of the Music and
Science and Communication, KTH Stockholm. Emotion Focus of the Swiss Center for Affective Sciences (Klaus R.
KLAUS R. SCHERER Founding Director of the Swiss Center for Scherer, Bernardino Fantini, Eduardo Coutinho, and their collaborators).
Affective Sciences, is Emeritus Professor at the Department of Psychology, The research was funded by an ERC Advanced Grant in the European
University of Geneva and Honorary Professor at the Department of Psy- Communitys 7th Framework Programme under grant agreement 230331-
chology, University of Munich. He has developed an appraisal theory of
PROPEREMO (Production and perception of emotion: an affective sci-
emotion and conducted many empirical investigations on the theoretical
ences approach) to Klaus R. Scherer and by the National Center of
predictions and on vocal and musical expression.
Competence in Research (NCCR) Affective Sciences financed by the
STPHANIE TRZNADEL obtained a masters degree in neurosciences
Swiss National Science Foundation (51NF40-104897) and hosted by the
at the University of Geneva and participates in ongoing research at the
Swiss Center for Affective Sciences. University of Geneva. We thank the opera singers for their collaboration
BERNARDINO FANTINI is an Emeritus Professor at the University of and Lucas Tamarit for help with the recording set-up. We also acknowl-
Geneva where he headed the Institute for the History of medicine. He edge precious support from Annett Schirmer at the National University of
writes on issues of musical and aesthetic emotions and directs a classical Singapore and Jamin Halberstadt at the University of Otago in New
music festival in Geneva. Zealand.
JOHAN SUNDBERG is an Emeritus Professor at the Department of Correspondence concerning this article should be addressed to Klaus R.
Speech Music Hearing, School of Computer Science and Communication, Scherer, Department of Psychology, University of Geneva, Boulevard du
KTH Stockholm. He is a leading expert on the acoustics of the singing Pont-d=Arve, 40, CH-1211 Geneva, Switzerland. E-mail: klaus.scherer@
voice. unige.ch

1
2 SCHERER, TRZNADEL, FANTINI, AND SUNDBERG

be inferred from vocal utterances, (a) whether similar expression try to separate the role of the lyrics and the music from purely vocal
production patterns can be found in the speaking and the singing aspects of expression, and examine the extent to which listeners are
voice, and (a) whether emotions are equally well-recognized from able to identify a singers expressive intention to communicate spe-
the singing and the speaking voice. Research in this tradition, cific emotions by voice quality alone and how this compares to
focused almost exclusively on the speaking voice, has a long emotion recognition from the speaking voice. We were able to recruit
history and has produced a large body of empirical results several world-class opera singers to vocally portray different emotions
(Coutinho, Scherer, & Dibben, 2014; Juslin & Laukka, 2003; by singing a series of nonsense syllables and schwa sounds ([]),
Juslin & Scherer, 2005; Pell & Kotz, 2011; Scherer, 1995, 2003; using the normal musical scale as a carrier. These expressive recita-
Scherer, Johnstone, & Klasmeyer, 2003). In most of these studies, tions were comprehensively analyzed for the underlying acoustic
actors have been asked to portray a number of different emotions structures by using advanced extraction techniques for a standard
by producing speech utterances with standardized or nonsense parameter set (see Scherer et al., in press for a detailed report on the
content. Groups of listeners are asked to recognize the portrayed acoustic analyses).
emotions. They are generally required to indicate the perceived We used the high-quality recordings in this singing emotion
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

emotion on rating sheets with standard lists of emotion labels, corpus to conduct a number of judgment studies. The recordings
This document is copyrighted by the American Psychological Association or one of its allied publishers.

allowing researchers to compute the percentage of stimuli per


were integrated into a web-based emotion recognition test that
emotion that was correctly recognized. Scherer, Clark-Polner, and
invited listeners to test how well they could recognize the singers
Mortillaro (2011) have reviewed the major studies in this area, all
intentions to portray different emotions from language- and
of which found better-than-chance accuracy in the recognition of
prosody-free sung recitations based on the classic musical scale.
vocally expressed emotion. Virtually all of this research has fo-
cused on the speaking voice, and there has been little effort to
understand the recognition of emotion from the singing voice. This Individual and Cultural Differences in
is all the more surprising, as both opera and song performances Recognition Ability
strongly rely on the singers ability to convey authentic emotions
to the audience, suggesting that listeners can indeed correctly The ability to infer emotions from the voice is part of the
recognize the singers expressive intentions. In an early study, emotional competence of individuals, a research area which has
Sherman (1928) had a singer convey different emotions by repeat- been rapidly growing in recent years, with a number of formal tests
edly singing a single note and simple melodic sequences and asked being developed (Bnziger, Grandjean, & Scherer, 2009). In con-
observers to name the emotions intended (surprise, fear-pain, sequence, the nature of individual differences such as age and
sorrow, and anger-hate). The results showed that both single tones gender in this ability are of interest, as well as the question of
and melodies reliably convey emotional significance to the lis- whether musical training is required to decode emotional interpre-
tener. Acoustic analyses of singers voices in concert settings, real tations in sung material. Therefore, in this study, we used a wide
or imagined, have shown that an intended emotional interpretation variety of listeners with different types of musical preferences and
produces changes in respiration (Foulds-Elliott, Thorpe, Cala, & different degrees of musical knowledge. Finally, the emotion rec-
Davis, 2000) and specific acoustic patterns (e.g., in intonation; ognition literature has a major stake in trying to understand to what
Sundberg, L, & Himonides, 2013). Furthermore, recent research extent cultural and linguistic differences will affect this ability.
on the acoustical parameters underlying the expression of affective This is particularly interesting with respect to emotional expres-
meaning in the singing voice has shown that there are indeed sion in music and singing, given the intriguing hypothesis that
specific acoustic patterns for different types of emotions, and has music and speech have a coevolved from primitive affect bursts,
suggested similarities to emotion expression in the speaking voice
conferring a special role to singing (Scherer, 2013a, 2013b). It is
(Eyben, Salomo, Sundberg, Scherer, & Schuller, 2015; Scherer,
often held that music is rather culture-specific, given the im-
Sundberg, Tamarit, & Salomo, 2015). For example, in emotional
portance of the differences in musical intervals and other struc-
singing (a) anger is generally expressed by a high level of loud-
tural aspects. However, empirical research on emotion recog-
ness, steep fundamental frequency (F0) contours, and more spec-
nition from music originating in different cultures has shown
tral energy in the higher frequency range; (b) sadness by a low
that listeners are quite able to infer some of the major emotions
level of loudness, a small amount of loudness variation, a low
degree of perturbation (e.g., jitter and shimmer) and slow tempo; from music of different cultural origin (e.g., studies on Western,
and (c) pride by high F0 level and high frequency and amplitude Japanese, and Hindustani groups, Balkwill & Thompson, 1999;
of the formants (see Table 2 in Scherer, Sundberg, Fantini, Balkwill, Thompson, & Matsunaga, 2004; a native African group,
Trznadel, & Eyben, in press). Fritz et al., 2009; and English, German, Hindi, and Arabic groups,
Pell, Paulmann, Dara, Alasseri, & Kotz, 2009). These studies show
that judges from different cultures tend to be attentive to both
Emotion Recognition From the Pure Singing Voice culture-specific musical cues and a variety of acoustic cues for
In both opera and song performances, the expression of emotion is certain emotions that may be more universal markers of emotion.
inextricably linked to the underlying text, as prescribed by the libretto In the case of singing, the potential effect of language-specific
(text of an opera) or the lyrics of songs (e.g., in Lieder). In addition, articulatory and prosodic characteristics complicates the issue. In
the melodies that composers have written for opera arias (airs) or for consequence, we were particularly interested to what extent lis-
lyrics in songs also carry substantial emotional information, as at- teners from different cultures, speaking different languages, could
tested by the copious work on the emotional power of music (see the recognize our recordings of singers producing emotion portrayals
contributions in Cochrane, Fantini, & Scherer, 2013). In this study, we based on the Western musical scale.
RECOGNIZING EMOTIONS IN THE SINGING VOICE 3

Emotion Differences Laukka, 2001) were among the first to use this model to study cue
utilization in emotion communication in music performances and
As in similar studies on emotion recognition from actor portray- for the encoding and decoding of vocal emotions. In an early study
als in speech samples, we were interested in determining not only
on the expression and perception of personality in the speaking
to what extent the accuracy of recognition would exceed chance
voice, Scherer (1978) proposed and tested an extension of the lens
levels but also whether certain emotions would be more accurately
model in which the cue domain is separated into (a) distal, objec-
recognized than others. A central question concerns the nature of
tively measurable cues (such as acoustic voice parameters for the
the most frequent confusions because these patterns can point to
speaker) and (b) subjective, proximal percepts of these cues (such
commonalities in the underlying acoustic profiles for specific
as voice quality impressions formed by the listener). The major
emotions. In this context, it is of particular interest to compare the
justification for this extension is that in perception and communi-
recognition level and the confusion matrix with the findings in
cation, the objectively measurable cues in vocal behavior are
studies on emotion recognition from speech. Another approach to
subject to a transmission process from sender to receiver (which
examine the role of the acoustic concomitants of emotion expres-
often adds noise) and need to be processed and adequately trans-
sion and recognition in singing is to attempt to automatically
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

formed by the sensorium of the receiver.


discriminate the expressed emotions based on acoustic parameters
This document is copyrighted by the American Psychological Association or one of its allied publishers.

More recently, Scherer (2013b) formalized the earlier sugges-


extracted from the recordings of the singers, and compare the
results with human judgment. tion for an extension of the lens model as a tripartite emotion
expression and perception (TEEP) model (Figure 1). Applying this
model to our specific research question, the internal state of the
Modeling the Emotion Communication Process
singer (e.g., the intended emotion interpretation) is (a) encoded via
in Singing distal vocal cues (as recorded and analyzed by acoustic analysis),
Much of the research on emotion recognition from vocal mate- (b) the listener perceives the singing sample and extracts a number
rial has been mainly interested in the issue of accuracy, with little of proximal cues (which can be independently measured by sub-
concern for the nature of the underlying inference mechanisms. jective voice quality ratings obtained from a separate group of
However, it seems quite evident that better understanding the naive observers), and, finally, (c) some of these proximal cues are
mechanisms will be a major asset for studies on emotion commu- used by the listener (e.g., in a recognition test as used here) to infer
nication. Scherer (1986, 2003) has suggested using Brunswiks the emotional interpretation of the singer with the help of sche-
(1956) lens model of perception as the guiding paradigm for work matic recognition or explicit inference rules. The first step in this
on the vocal communication of emotion because it considers both process describes the externalization of the emotional interpreta-
the production/encoding and the perception/decoding aspect of the tion, the second step the transmission of the acoustic information
process. Juslin and his collaborators (Juslin, 2000; Juslin & and the forming of a perceptual representation of the physical

Figure 1. Graphic illustration of the tripartite emotion expression and perception (TEEP) model.
4 SCHERER, TRZNADEL, FANTINI, AND SUNDBERG

speech/voice signal, and the third and last step the inferential ware (Eyben et al., 2015) for the automatic extraction of an
utilization and emergence of an emotion attribution. extended parameter set from the recordings in the singing corpus
In this article, we report the results of this work on perception/ described above. Details of the acoustic analyses are provided in
recognition of emotion in the singing voice. The major questions Scherer et al. (in press), describing the production aspect of the
examined are as follows: study, and in the supplemental material to this article. In the
present work, we refer to the data reported in that article, in
(1) Are listeners able to recognize the intended expressive particular six major acoustic scales based on principal component
target with better-than-chance accuracy? Are there indi- analyses: Loudness (different indicators of high vocal intensity or
vidual differences between listeners? Does culture or energy), Dynamics (tempo, mean perturbation, and steep rise/fall
language affect the ability to recognize the emotional slopes for F0 and loudness), Perturbation Variation (variability of
interpretations? jitter, shimmer, and harmonic-to-noise ratio as measured by the
coefficient of variation), Low Frequency Energy (proportion of
(2) Are some emotions more easily recognized than others?
energy in lower ranges of the spectral distribution compared to
How do the recognition results compare with accuracy
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

higher regions), Formant Amplitude (loudness variation with low


rates and confusion patterns in studies using speaking
This document is copyrighted by the American Psychological Association or one of its allied publishers.

formant amplitude), and F0 (level and range of the fundamental


voices? How does the judgment performance compare
frequency).
with results of statistical classification based on acoustic
parameters?
Conducting the Recognition Test
(3) What is the nature of the process involved in inferring
emotions from a singing sample? What are the major Participants. We recruited the following five samples of lis-
acoustic variables involved, and how do perceived con- tener judges for the recognition study:
figurations of voice quality mediate the inferences?
(1) An international sample of volunteer participants re-
sponding to the possibility of participating in a web
Method emotion recognition study advertised on the Swiss Cen-
ter for Affective Sciences Online experimentation web-
Recording of the Singing Corpus site (with additional announcements during World
Voice Day, 2014, and via social networks), N 199,
Singers. We recorded eight professional opera singers, seven 39.7% students, 69.3% women, 37.2% over 40 years, no
of whom regularly interpret major roles in leading international
remuneration.
opera houses (one singer having completed her training at the
conservatory and starting her career): two sopranos, two mezzo- (2) A U.S. survey conducted via the Qualtrics Panel survey
sopranos, two tenors, one countertenor, and one bass-baritone. The division (Qualtrics, Provo, Utah), N 266, 39.1% stu-
emphasis on excellence and professionalism brings about a num- dents, 71.1% women, 33.1% over 40 years, remunerated
ber of limitations, the most important one being the relatively via gift vouchers.
small number of stimuli on which the study is based, limiting the
statistical power and the use of multivariate statistical analysis (3) A sample recruited by posted announcements at the
techniques. University of Geneva, N 26, 73.1% students, 61.5%
Design and vocal production material. We chose a subset of women, 7.7% over 40 years, remunerated on an hourly
emotion categories regularly used in recognition studies with a basis
speaking voice that seemed appropriate to be interpreted in the
context of an opera libretto, namely, anger, despair, fear, joy, love, (4) A sample recruited by posted announcements at the
pride, sadness, serenity, and tenderness. The vocal materials were University of Otago, New Zealand, consisting of two
a standard sentence consisting of meaningless syllablesne kal subgroups: (a) persons of European descent, native Eng-
ibam soud molen used to construct a large corpus of emotion lish speakers, N 33, 90.9% students, 60.6% women,
expression in speech (Geneva Multimodal Emotion Portrayals; 0% over 40 years; and (b) persons of Maori descent,
Bnziger, Mortillaro, & Scherer, 2012) and the sustained vowel [] native English speakers, N 13, 61.5% students, 61.5%
(the schwa sound). We asked the singers to sing both types of women, 7.7% over 40 years, remunerated on an hourly
material on the standard musical scale (both upward and down- basis.
ward) and to imagine that they wanted to project the respective
emotional tonality in their interpretation of a lyrical work. (5) A sample recruited by posted announcements at the
Recording of the singing samples. Singers were recorded in National University of Singapore, consisting of two
individual sessions at the Conservatoire de Musique de Genve subgroups: (a) Singaporeans of Chinese descent, habit-
and in a studio of the Swiss Radio Corporation in Geneva, using ually speaking English and responding to the test in this
professional audio equipment operated by a sound engineer. De- language, N 37, 94.6% students, 5.4% women, 2.7%
tails of the equipment and the settings of the recordings are over 40 years; and (b) Singaporeans of Chinese descent,
described in Section A of the supplemental material. habitually speaking Chinese and responding to the test
Acoustic measurement. We used the recently developed Ge- in this language, N 37, 97.3% students, 2.7% women,
neva Minimalistic Acoustic Parameter Set and its dedicated soft- 0% over 40 years, both remunerated on an hourly basis
RECOGNIZING EMOTIONS IN THE SINGING VOICE 5

To accommodate such a large number of judges, we had to familiarize themselves with the task. After the three example trials,
conduct the judgment study in the form of a web application. they listened to one of the two lists with 33 audio stimuli presented
Although the findings are expected to be very stable, given the in random order and chose what they considered the most appro-
large N, there are limitations with respect to the comparability of priate emotion label by clicking on the corresponding button. At
the conditions under which participants listened to the stimuli. In the end of the test, participants received feedback with their overall
addition, there are limitations related to the scarcity of personal score for the task.
information; because of the need for complete anonymity, we
asked only for age, gender, and whether the person was a student
Voice Quality Ratings
or not (and for some student samples, nationality and language).
Despite these limitations, the data reported under Results allow us Ratings for the proximal voice cues of the opera singers vo-
to discuss preliminary answers to the questions posed at the end of calizations were collected from an additional group of participants
the introduction. in order to investigate the effects of voice quality on emotion
Design of the recognition test. We decided to use only the perception.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

nonsense-syllable sentence for the recognition test, as it seems Participants. Nineteen individuals (63% females, 5.3% over
This document is copyrighted by the American Psychological Association or one of its allied publishers.

closer to the kind of singing listeners are used to rather than the 40 years) were recruited through advertisements at the University
simple scale using schwa [] sounds. Each emotion was to be of Geneva to participate in exchange for a small remuneration.
represented by vocalizations from each of the singers. We Procedure. The ratings were collected in a single session in a
decided not to select the stimuli to be included in the testing set classroom at the University of Geneva, where all participants were
on the basis of the presumed quality of the productions by the invited to come at the same time. Audio recordings were played on
different singers. The reason is that emotional interpretation in a laptop computer connected to the classroom loudspeakers, and
singing is determined by the intuition and intention of the the volume was adjusted to a comfortable level. Each participant
individual singer, which are subjective by definition. In addi- was given eight (one per singer) paper copies of the French version
tion, there is no body of expertise that could serve as objective of the Geneva Voice Perception Scales (GVPS; Bnziger, Patel, &
criteria for an expert panel to make such a selection. Singers were Scherer, 2014). The GVPS consists of eight linear and continuous
given the opportunity to repeat the recordings of individual emo- two-dimensional scales ranging from one end to the other, repre-
tions until they were satisfied with their interpretation. However, senting eight different characteristics of the voice: pitch, loudness,
there is undoubtedly some variation in the expressivity of the modulation of intonation, speech rate, articulation, (in)stability,
different singing samples, especially as the singers sometimes roughness, and sharpness. After receiving instructions about the
mentioned that they found it easier to portray certain emotions experimental task, participants were presented with 16 example
rather than others. audio recordings illustrating each end of the eight scales (low
As the first three singers recorded did not interpret three of the pitch, high pitch, low volume, high volume, etc.) to ensure that
emotions on the final list, the total number of items for the test was they understood all the labels on the scales. The example stimuli
63. Because of the large number of stimuli, we divided them into consisted of a male voice uttering the sentence I cannot believe it
two lists, and participants received one of the two lists. Three (in French: Je ne peux pas le croire), emphasizing both extremes
stimuli were included in both List 1 and 2 to determine potential of each characteristic. After hearing the example stimuli, partici-
group differences. The list presented to each participant was coun- pants were presented with the recordings of the opera singers,
terbalanced every time a new participant clicked on the link to start which they had to rate using the GVPS. For each recording,
the survey. participants were asked to mark the position of the voice on all
Procedure. Participants were provided with a link to a server eight scales. Each recording was replayed repeatedly while it was
that automatically started the recognition test described above in rated so raters could base their judgment on continuous exposure
their own language. Participants in the web volunteer sample could to the respective voice. When all participants had finished, the next
choose their preferred language (English, French, or German), but recording was presented in the same fashion. The recordings were
in all other cases, the language was imposed. Participants were free presented one singer after the other; that is, we played all the
to complete the test from their home or the university, and they recordings for one singer before moving on to the next singer, and
could choose to use either headphones or speakers to listen to the the order of presentation was the same for each singer. We did not
audio stimuli. Before starting the task, participants were instructed randomize the order of singers and of recordings because (1) the
that they would hear eight internationally renowned opera singers vocal range (F0 range, tessitura) of the different voice types (from
expressing nine different emotions through meaningless sound counter tenor to bass-baritone) is extremely different and, as F0
sequences and that they would choose, for each vocalization, the strongly influences voice quality judgment, we preferred listeners
emotion word that they considered closest to what the singer to focus on the differences within a certain range and avoid sudden
intended to express. The buttons for the nine response options jumps between stimuli; and 2) randomization of stimulus presen-
(anger, despair, fear, joy, love, pride, sadness, serenity, and ten- tation introduces random effects that will even out for a large
derness) were arranged in three columns of three response buttons, number of stimuli but may adversely affect judgments of a re-
and participants had to click on the appropriate button. The pre- strained number of stimuli. Participants provided 63 8 504
sentation order of the labels on the response buttons remained the ratings. After the experimental session, ratings were transferred
same throughout the experiment. If participants were uncertain from the analog scales on the paper questionnaires to numbers, by
about their response, they had the option to replay the stimulus. dividing the scales into five segments of equal length. Thus, each
After adjusting the audio settings to comfortable listening condi- analog rating received a score ranging from 1 (left end of the scale)
tions, participants were presented with three example stimuli to to 5 (right end of the scale). Interrater reliability was calculated by
6 SCHERER, TRZNADEL, FANTINI, AND SUNDBERG

using Cronbachs coefficient, showing very high reliability of Maori group at the lower end and the U.S. panel and the web
the ratings (Cronbachs .926 for the whole sample). volunteers at the high end (Table S2a). Age differences, F(3,
550) 4.05, p .007, 2 .022, showed the oldest group (older
Results than 60 years) to have significantly lower accuracy levels based on
the post hoc test (Table S2b; but note that the effect size is low, and
Data Analysis only 4.1% of the participants are in this class). A significant but
weak Gender Student status interaction effect emerged, F(1,
All participants who finished the test were included in the 550) 6.07, p .014, 2 .011, showing female students to be
analysis. One-way analyses of variance (ANOVAs) for each of the more accurate judges than their male counterparts, whereas there
three stimuli presented in both lists revealed no significant differ- was little difference for nonstudent participants. A multivariate
ences, suggesting that the two subgroups who heard different ANOVA for an 8 2 4 2 (Group Gender Age
stimulus sets can be combined in the same analyses. Student status) comparison did not show sizable significant differ-
ences (see Table S3 in supplemental material).
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Recognition Accuracy We also computed recognition rates by singers. The average


This document is copyrighted by the American Psychological Association or one of its allied publishers.

accuracy for the least well-recognized of the eight singers was


The results obtained for recognition accuracy are shown in
21%, whereas the average for the best-recognized singer reached
Table 1 in the form of percentages of correct recognition for each
57%. The mean value across all eight singers was 34%, with a
of the nine emotions for each group of participants and the total for
standard deviation of 12%. The potential causes of these differ-
all emotions combined. The mean accuracy was 33.1%, with a
ences are difficult to determinethey could be due of idiosyn-
minimum of 7% and a maximum of 64% and a standard deviation
of 9.1%. These raw percentages have to be interpreted with caution cratic interpretations of different emotions, a greater affinity with
as potential judges response biasesthat is, choosing certain certain emotion expressions, restricted range of expressivity, or
response categories more often than others can affect the chance even limitations due to the nature of singers voice type or tessi-
level (because overuse of a certain category will make hits more tura. Another potential factor is the listeners preconceptions con-
likely). We therefore corrected the theoretical chance level cerning the affective nature of certain voice types.
(11.1%) for over- or underuse of judgment categories, separately To better understand the differences in the recognition rates for
for each emotion category (listed in the last row of Table 1). The the different emotions, we computed the confusion matrix for the
table shows that, many of the accuracy percentages exceed the percentage of items in each target class (correct answer) judged to
respective chance levels by a factor of two or three (and sometimes belong to a specific emotion category (choice), as shown in Table
even more). However, there are major differences between the 2. The matrix shows two major confusion clusters: (1) anger, pride,
emotions and the different subgroups. Although participants in all and contempt; and (2) sadness, tenderness, and serenity. Thus, the
samples and subgroups recognized anger, fear, joy, pride, sadness, low accuracy rate for contempt is probably due to anger and pride
and tenderness with percentages well above the corrected chance being better known and more frequently encountered emotions.
levels, this was not the case for contempt, love, and serenity. Similarly, serenity (or being calm) is rarely seen as an emotional
We analyzed the significance of the observable differences state and was thus more readily called sadness or tenderness.
between the samples and subgroups, as well as the potential role of Cluster 1 is probably based on the vocal expression of high power,
student status, age, and gender on the average accuracy over all whereas the central factor for Cluster 2 is most likely low power
emotions. We performed an 8 2 4 2 univariate ANOVA and low arousal. The fact that each of the confusion clusters
(Group Gender Age Student status) for the total correct includes both negative and positive emotions confirms earlier
score, which yielded two significant main effects and one signif- claims that the voice is less suited to communicate valence than it
icant interaction effect (see Table S1 in supplemental material). is to communicate power and arousal (see Scherer et al., 2011, p.
Group differences, F(7, 550) 4.22, p .001, 2 .044, were 415). In sum, anger and fear were best recognized with hit rates
examined with post hoc Tukey B homogeneous subgroups test, that largely exceeded chance expectancy, followed by pride and
which yielded three largely overlapping subsets, with the NZ sadness. Joy and tenderness were still recognized to some extent.

Table 1
Accuracy of Judgement in Percentage by Participant Groups and Emotions (Theoretical Chance Level 11.1%)

Group Anger Contempt Fear Joy Love Pride Sadness Serenity Tender-ness Total

Web volunteers (EN, FR, GE) 49.3 20.9 44.8 34.6 13.1 41.4 39.8 21.6 28.7 34.1
U.S. Survey EN 46.9 21.5 44.5 31.7 15.6 42.0 39.0 27.9 28.2 34.8
Uni Geneva FR 44.4 22.2 37.3 23.0 10.4 30.8 39.6 22.2 31.3 30.0
Uni NZ Otago European EN 42.9 14.1 33.4 26.4 11.1 34.3 44.1 17.6 20.2 29.3
Uni NZ Otago Maori EN 27.7 2.5 35.1 36.6 12.8 27.3 34.6 22.5 19.5 25.7
Uni Singapore Chinese EN 49.5 17.5 30.1 17.4 13.0 28.8 32.4 26.1 18.0 27.3
Uni Singapore Chinese CH 47.0 14.2 36.4 22.0 18.4 33.7 38.8 22.5 23.9 29.5
Chance level corrected for response bias 9.4 10.4 8.4 10.0 9.5 14.2 14.9 11.6 11.5
Note. Uni recruitment via university postings; ethnic origin: European New Zealanders with European descent, Maori New Zealanders with Maori
descent, Chinese Singaporeans with Chinese descent; languages: CH Chinese; EN English; FR French; GE German.
RECOGNIZING EMOTIONS IN THE SINGING VOICE 7

Table 2
Confusion Matrix of Emotion Judgements vs. Correct Answers for All Groups of Judges Combined (in %), Correct Targets Compared
With the Recognition Levels for Actor Emotion Portrayals in Nonsense Sentences

Correct answers (singer intention)


Choices Anger Contempt Fear Joy Love Pride Sadness Serenity Tenderness

Anger 48.6 (59) 12.9 4.4 5.4 2.7 8.8 0.3 0.5 0.5
Contempt 21.8 19.0 (24) 5.7 10.7 10.1 15.0 2.5 4.3 4.3
Fear 4.3 2.8 43.2 (59) 5.5 4.6 3.9 6.8 2.0 2.8
Joy 7.4 9.0 8.2 30.1 (28) 10.0 15.3 2.3 4.1 4.1
Love 0.8 8.2 5.7 6.2 13.8 () 6.7 10.6 15.3 18.5
Pride 14.3 28.3 4.2 19.5 14.8 37.3 (16) 1.5 4.6 3.6
Sadness 1.3 3.5 16.7 11.9 16.4 4.9 39.2 (18) 21.7 18.5
Serenity 1.5 10.0 5.3 5.3 16.3 5.1 15.8 24.4 () 21.1
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Tenderness 0.1 6.4 6.5 5.4 11.3 2.9 21.0 23.1 26.6 (23)
This document is copyrighted by the American Psychological Association or one of its allied publishers.

Chance level corrected for response bias 9.4 10.4 8.4 10.0 9.5 14.2 14.9 11.6 11.5
Note. Average accuracy of singer judgments across emotions: 33.1%; (xx) accuracy percentages achieved for vocal channel of actor speech portrayals
of emotions (from column A under Core set in Table 5 in Bnziger et al., 2012), average accuracy 32.4%; numbers in italics 10%; Boldface indicates
percentage of correct judgments (hit rate).

In contrast, the hit rates for contempt, love, and serenity were close frequently than others, the chance level were corrected upward).
to chance largely because of frequent confusions with other emo- Joy and tenderness attained a similar level.
tions. This impression was confirmed by an ANOVA, F(8, 62) Given the low recognition rate for serenity, love, and contempt,
2.575, p .019, 2 .276, with post hoc analysis for emotion in the remainder of this article we focus on the six more estab-
differences that placed these three emotions together at the very lished emotions with more satisfactory emotion communication
low end of Subset 1 (i.e., maximally different from with anger at potential in the singing voice. Recomputing the average recogni-
the high end of Subset 2; see Table S4 in supplemental material). tion rate using only the accuracy percentages for the six best
How does this compare with the accuracy percentages found for emotions yields an accuracy percentage of 37.0%. It is interesting
actors portrayals of emotion in the speaking voice? Scherer et al. to compare this accuracy percentage for human judges with the
(2011) reviewed the empirical research findings on emotion de- accuracy of an automatic emotion classification/recognition based
coding from vocal expression in speech and reported the following on acoustic parameters. As described in the Method section, the
mean accuracies for the major emotions: anger 74.9%, fear 62.4%, vocal stimuli discussed here have been extensively analyzed with
joy/happiness 54.0%, and sadness 74.9%. However, these accu- the Geneva Minimalistic Acoustic Parameter Set acoustic param-
racy percentages are not directly comparable with the data pre- eter extraction tool (Eyben et al., 2016). The results are reported in
sented here, as in many studies in the literature only a few a separate paper (Scherer et al., in press) in which a multiple
maximally different basic emotions have been studied, which discriminant analysis with five major acoustic factors has been
reduces the probability of confusions between emotions with sim- computed. The resulting confusion matrix of the classification is
ilar valence or arousal levels and encourages the use of simple reproduced in Table 3 to allow comparison with the human recogni-
exclusion rules rather than direct recognition. Bnziger et al. tion patterns reported above. The overall accuracy of the automatic
(2012), who used the core set (about 150 actor expressions for 18 classification is 43.5%, compared with 37.0% found for the human
emotions) of the Geneva Multimodal Emotion Portrayals corpus to judges. A close inspection of the confusion matrix in Table 3 shows
conduct a recognition study, have reported results that are more that this difference is largely due to the relatively lower accuracy
comparable to the present data. Comparability is even more as- percentages in the human judgments for sadness and tenderness on the
sured by the fact that the same nonsense-syllable utterance was one hand and pride on the other. The reason for this is probably that
used in both studies, either spoken or sung. The results of the the human judges had nine categories to choose from and the dis-
ratings of the audio channel of these clips only are also shown in criminant analysis only six. Given that in the original judgment study
Table 2 (shown in parentheses next to the hit rates). The respective there were strong confusions of both sadness and tenderness with
values are not directly comparable, as the raters in the Bnziger et serenity and love and of pride with contempt (see Table 2), the overall
al. (2012) study had to choose among 18 alternatives, which accuracy is likely to be higher for the human judges than for the
reduces the chance level by half and is likely to increase confu- machine classification. This assumption is confirmed by the fact that
sions. On the other hand, the clips in the core set had been chosen some of the machine confusions are much larger and more inappro-
on the basis of emotion ratings of a much larger (1,000 items) priate than the corresponding human confusions, for example, in the
corpus of expressions, a preselection that should reduce the num- case of joy being classified as anger or the mutual confusions of
ber of confusions and increase the likelihood of correct recogni- tenderness and fear.
tion. The comparison of the values in the diagonal of the confusion
matrix in Table 2 suggests that judges were better able to recognize
Modeling the Recognition Mechanism
anger and fear from the speaking compared with the singing voice,
whereas sadness and pride were better recognized in the singing Bnziger, Hosoya, and Scherer (2015) successfully used the
voice (however, as these two response categories were used more TEEP model to analyze the complete process of the encoding of
8 SCHERER, TRZNADEL, FANTINI, AND SUNDBERG

Table 3
Confusion Matrix (% Accuracy) for Machine Classification of Singers Emotion Portrayals in
Comparison With Human Judgment

Actual emotion (singer intention)


Predicted/judged
emotion Anger Fear Joy Pride Sadness Tenderness

Anger AC 46.7 12.5 43.8 12.5 0 0


HJ 48.6 4.5 5.4 8.9 .3 .5
Fear AC 13.3 43.8 12.5 0 6.3 18.8
HJ 4.3 43.2 5.5 3.9 6.8 3
Joy AC 33.3 12.5 25 31.3 6.3 0
HJ 7.4 8.2 30.1 15.3 2.3 4.1
Pride AC 6.7 0 18.8 56.3 0 0
HJ 14.4 4.3 19.6 37.3b 1.5 3.7
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Sadness AC 0 12.5 0 0 56.3 37.5


This document is copyrighted by the American Psychological Association or one of its allied publishers.

HJ 1.3 16.7 11.8 4.9 39.2a 18.5a


Tenderness AC 0 18.8 0 0 31.3 43.8
HJ .1 6.4 5.4 3 21 26.6
Note. AC automatic machine classification; HJ human judgment. Boldface indicates percentage of correct
judgments (hit rates).
a
Lower because of confusions with serenity and love. b Lower because of confusion with contempt; bold
numbers refer to the cases discussed in the text.

emotions by actors to the decoding by naive raters (judging the Table 1, the best recognized of the singers emotion interpretations
intended emotions), using structural equation modeling to param- and surpasses the accuracy level achieved by multiple discriminant
etrize the model and test the goodness of fit. As structural equation analysis (; see Table 3). The correlation between anger expressed
modeling (and to some extent also hierarchical regression proce- and anger inferred is r .86 (p .01). The TEEP model illustrates
dures) requires a large number of observations to yield reliable the mechanism that produced this excellent result: On the distal
results, we could not use these techniques, as we had only a side, the acoustic measurements of high loudness, high dynamics
relatively small number of observations (48 singing samples, re- (rate, F0 contour, loudness variation) and weak low frequency
stricting the analysis to the major six emotions). In consequence, energy correlate with singer portrayals of anger. These acoustic
we use descriptive graphs with Pearson correlations between the characteristics are correctly perceived by human judges (as high
elements of the model to visualize potential models and develop volume, variable intonation, rapid rate/tempo, and low vocal in-
hypotheses for future research. Figures 2 and 3 illustrate this stability), indicating that the appropriate cues for the inference of
procedure for anger and sadness. the underlying emotion are available. We assume that the ratings
For the material reported below, we adopted dichotomous vari- of instability reflect the difference between adjacent F0 periods,
ables for the expressed emotions (e.g., anger expressed 1 and regardless of whether it is caused by random variation or nonran-
other emotions expressed 0) to allow correlational analysis. dom variation such as vibrato (at least in classical singing, random
Figure 2 shows the TEEP model for anger, which is, as shown in variation would mostly be interpreted as a sign of poor vocal

Figure 2. Tripartite emotion expression and perception model of anger inference from the singing voice.
RECOGNIZING EMOTIONS IN THE SINGING VOICE 9
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.

Figure 3. Tripartite emotion expression and perception model of sadness inference from the singing voice.

control). On the proximal side, the use of these cues by the judges The case of sadness inference, illustrated by the TEEP model
inferring the emotions interpreted by the singers corresponds ex- shown in Figure 3, is somewhat different. The correlation between
actly to relationships on the distal side. In other words, anger sadness expressed and sadness inferred is r .58 (p .01). Here
portrayals have clear acoustic concomitants, the respective acous- the frequent confusions with tenderness, serenity/calm, and love
tic parameters are correctly perceived, and the inference rules (shown in Table 2) seem to be due to a major discrepancy between
mirror the distal expression patterns. However, this appropriate the distal expression and the proximal inference: Judges use cues
configuration is not sufficient to obtain a recognition rate higher of a low level of dynamics as a general cue for sadness, tenderness,
than 50% because of the confusions with contempt and pride. serenity/calm, and love. However, on the distal side, the correla-
Although the confusion with contempt may be explained by the tion between sadness expressed and the acoustic dynamics com-
fact that contempt is sometimes blended with anger, the confusion ponent, while indeed negative, is rather low and nonsignificant.
with pride is clearly due to the lack of acoustic parameters in the Thus, although judges correctly perceive low loudness and slow
model that capture the central difference in positive and negative rate/tempo in sadness portrayals, they may overgeneralize the
valence between anger and pride. It seems that vocal expression covariation of these two parameters with low dynamics (as is
lacks powerful cues for valence discrimination (especially in com- suggested by the high correlations between acoustic dynamics and
parison with facial expression where smiling [zygomaticus activ- judged volume, intonation, and rate in the transition part of the
ity], with or without accompanying speech activity, is a powerful model). Figure 4 illustrates the case of joy. The correlation be-
and ubiquitous signal for positive valence; Matsumoto, Keltner, tween joy expressed and joy inferred is r .58 (p .01). Here, the
Shiota, Frank, & OSullivan, 2008; Scherer et al., 2011). Because lower recognition rate is probably due to judges using the proximal
the 0/1 coding of anger expression (other emotion/anger) as a cue of variable intonation (produced by the dynamics factor) to
dichotomous variable correlates with r .86, with the percentage infer joy. However, there is no strong distal relationship by joy as
of judges having chosen anger to label the respective interpreta- expressed by the singers and a high level of dynamics.
tions, at least some weak valence cues may be available but are not The discussion of Figures 2 to 4 was meant to provide an
measured by our current parameter set. example of how, with the help of the TEEP model, one might

Figure 4. Tripartite emotion expression and perception model of joy inference from the singing voice.
10 SCHERER, TRZNADEL, FANTINI, AND SUNDBERG

arrive at a better understanding of the process of vocal communi- nized in the singing voice. Pride, which is rarely studied, also
cation of emotion in singing, as well as the probability of success- achieves a remarkable accuracy score. In contrast, contempt, love,
fully conveying an emotional interpretation of a character in a and serenity, which are not infrequently encountered in operas, are
particular situation to the listener. relatively close to chance level. As to contempt, this emotion is
also frequently confused with anger in spoken expressions. As to
Discussion and Conclusion love and serenity, the vocal expressions of the emotions are close
We have reported what we believe to be the first systematic to those of sadness and tenderness (low vocal energy and slow
empirical investigation of the extent to which nonprofessional tempo), and thus frequent confusions occur with the latter two
listeners can recognize emotional interpretations of musical mate- emotions. It seems likely that sadness and tenderness are penalized
rial by singers. In designing the study, we made the choice to use by these confusions and would probably obtain better accuracy if
highly professional opera singers to ensure a sufficient degree of love and serenity were not provided as potential alternatives.
expertise and experience of encoding a wide variety of emotional
(2) How does the recognition ability compare with accuracy
expressions, recorded under studio conditions, to obtain stimulus
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

rates in studies using speaking voices? How does the


material of high quality and realism for acoustic analyses and
This document is copyrighted by the American Psychological Association or one of its allied publishers.

judgment performance compare with results of statisti-


judgment studies. We were able to obtain the full collaboration of
cal classification based on acoustic parameters?
eight professional opera singers, the majority of whom regularly
perform on major opera and concert stages in the world. Over a Given that singers were asked to encode the different emotions
period of 4 years, we approached major opera singers who were as they would in a stage performance, it might seem reasonable to
performing in Geneva to ask whether they could afford to spend an assume that the acted emotions should be recognized with much
afternoon recording the emotional portrayals in a studio. better-than-chance accuracy. However, as the design of the study
With respect to the listener judges, we chose to recruit a large required the singers to sing a sentence-like sequence of nonsense
number of listeners from different cultures, speaking different
syllables with the intervals of the musical scale, they did not have
languages and with different musical preferences and degree of
all of the usual features of a vocal performance available for their
musical knowledge, to obtain a representative sampling of the
interpretation, especially pitch height and variable melody (into-
ability to recognize emotions in the singing voice. We obtained a
nation). In consequence, one would expect lower recognition ac-
large representative sample of listeners in addition to student
curacy compared with natural performances and with the recogni-
groups in different countries.
tion of emotions in the speaking voice. The latter was indeed true
The results provide clear answers to the three major questions
for anger and fear (which are often characterized by important
posed in the introduction:
changes in fundamental frequency, including dramatic intonation
(1) Are listeners able to recognize the intended expressive in speech expression). Somewhat surprisingly, pride and sadness
target with better-than-chance accuracy? Does culture or were relatively better recognized than the levels attained in studies
language affect the ability to recognize the emotional on the speaking voice (partly because the two categories were used
interpretations? Are some emotions more easily recog- more often by the judges). One possible explanation might be that
nized than others? because pride and sadness seem to be relatively rarely encountered
in real life (Scherer, Wranik, Sangsue, Tran, & Scherer, 2004), and
Despite the high variability in listener characteristics and listen- thus possibly rarely encountered in everyday speech, they are quite
ing situations, we found an accuracy percentage (33%) that is three common in opera (e.g., proud kings or heroes and sad lovers) and
times higher than chance expectancy (11%), suggesting that un- are often expressed in arias. This may have led judges to use the
trained listeners are indeed able to recognize a reasonable number respective categories more frequently compared with neighboring
of emotion expressions in singing voices. As expected, major labels such as joy in the case of pride or tenderness in the case of
individual differences in recognition ability were observed, rang- sadness. In future work on the acoustic concomitants for pride and
ing from a minimum of 7% to a maximum of 64% with a standard sadness expressions in the singing voice, researchers will need to
deviation of 9.1%. Interestingly, there were few statistically sig- search for the auditory cues available to listeners to detect these
nificant differences for gender, age, or student status. Future work emotions reliably in singing (as well as their potential absence in
needs to examine the issue of individual differences in detail, emotion expression in the speaking voice).
adding important background variables such as musical prefer- The importance of identifying the acoustic cues underlying
ences, music knowledge, and training. We also found a remarkable emotion recognition is illustrated by the comparisons of the per-
stability of the accuracy rates across languages and cultures (at formance of our judges with automatic statistical discrimination
least for the small number of countries we sampled), the majority classification (using data reported in Scherer et al., in press). The
of the participants having performed the task anonymously). Al- discriminant analyses were performed only for the five best rec-
though there are some differences, they cannot be reliably assigned ognized emotions. This means that the accuracy rates of the
to language or culture differences, given that the percentages of statistical classifier were boosted by the fact that there were fewer
students in the respective samples are not comparable. This is an categories to choose from and consequently less confusion. This is
issue that will need to be explored more systematically in further indeed what we found: The multiple discriminant analysis classi-
research, if possible taking the relative importance of singing in the fier outperformed the human judges precisely in the case of those
respective cultures into consideration. emotions (sadness and tenderness) for which we found frequent
As to differences in the recognizability of different emotions, confusions (with love and serenity). The classifier showed the
the data show that the classic basic emotions are also best recog- same strong confusion patterns as the judges between sadness and
RECOGNIZING EMOTIONS IN THE SINGING VOICE 11

tenderness because of the reliance on similar acoustic parameters emotional expression on the stage (see Scherer, 2013c) and how
(low vocal energy and slow tempo; Figure 3). Similarly, the this expression potentially differs from professional mimicking
classifier was more accurate for pride, as it did not have the choice that does not involve any emotional participation. The results
of contempt, which was often confused with pride by the human obtained in our attempt to model the inference mechanism under-
judges (see Table 2). Humans also confused pride with anger more lying emotion recognition provides important insights and encour-
frequently than was the case for automatic classification based on ages further efforts in the direction of more complex research
acoustic cues. Interestingly, humans were slightly better in recog- designs, combining the study of expression and impression (rec-
nizing joy, in part because they rarely confused it with anger as the ognition) of emotion in music and particularly singing.
classifier was prone to do (because both anger and joy were
characterized by high vocal energy and fast tempo; Figures 2 and References
4). Apparently, human judges have access to some cues that
distinguish joy and anger with respect to positive and negative Balkwill, L.-L., & Thompson, W. F. (1999). A cross-cultural investigation
of the perception of emotion in music: Psychophysical and cultural cues.
valence (despite the fact that the voice is not ideally suited to
Music Perception, 17, 43 64. http://dx.doi.org/10.2307/40285811
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

express valence). However, none of the acoustic parameters fed Balkwill, L.-L., Thompson, W. F., & Matsunaga, R. (2004). Recognition of
This document is copyrighted by the American Psychological Association or one of its allied publishers.

into the automatic classification seem to provide this information. emotion in Japanese, Western, and Hindustani music by Japanese lis-
This reflects a common problem frequently reported in the litera- teners. Japanese Psychological Research, 46, 337349. http://dx.doi
turethe difficulty in finding valid vocal cues for valence (as .org/10.1111/j.1468-5584.2004.00265.x
compared with power [vocal energy] and arousal [tempo]). The Bnziger, T., Grandjean, D., & Scherer, K. R. (2009). Emotion recognition
fact that human judges apparently use valence cues encourages from expressions in face, voice, and body: The Multimodal Emotion
further attempts to identify valence cues in work on vocal expres- Recognition Test (MERT). Emotion, 9, 691704. http://dx.doi.org/10
sion. .1037/a0017088
Bnziger, T., Hosoya, G., & Scherer, K. R. (2015). Path models of vocal
(3) What is the nature of the process involved in inferring emotion communication. PLoS ONE, 10, e0136675. http://dx.doi.org/10
.1371/journal.pone.0136675
emotions from a singing sample? What are the major
Bnziger, T., Mortillaro, M., & Scherer, K. R. (2012). Introducing the
acoustic variables involved, and how is their perception Geneva Multimodal Expression Corpus for experimental research on
mediated by perceived configurations of voice quality? emotion perception. Emotion, 12, 11611179. http://dx.doi.org/10.1037/
a0025827
The nature of the expression and perception/inference processes Bnziger, T., Patel, S., & Scherer, K. R. (2014). The role of perceived voice
underlying emotion recognition in the singing voice was the object and speech characteristics in vocal emotion communication. Journal of
of the third question investigated in this article. We used the TEEP Nonverbal Behavior, 38, 3152. http://dx.doi.org/10.1007/s10919-013-
model, inspired by Brunswiks lens model, to illustrate how the 0165-x
type of data presented here can be used to investigate these Brown, S. (2000). The musilanguage model of music evolution. In N. L.
processes. However, we could only illustrate the theoretical model Wallin, B. Merker, & S. Brown (Eds.), The origins of music (pp.
with correlation coefficients rather than fitting an exact statistical 271300). Cambridge, MA: MIT Press.
Brunswik, E. (1956). Perception and the representative design of psycho-
model because of the relatively low number of observations re-
logical experiments. Berkeley, CA: University of California Press.
sulting from using experienced internationally known opera sing-
Cochrane, T., Fantini, B., & Scherer, K. R. (Eds.). (2013). The emotional
ers. The three cases discussed show several examples of highly power of music. Oxford, United Kingdom: Oxford University Press.
functional communication processes (allowing high accuracy) in http://dx.doi.org/10.1093/acprof:oso/9780199654888.001.0001
which valid distal cues were correctly transmitted to the listener as Coutinho, E., Scherer, K. R., & Dibben, N. (2014). Singing and emotion.
proximal cues, and the latter were used in an appropriate fashion In G. Welch, D. M. Howard, & J. Nix (Eds.), The Oxford handbook of
for the inference (mirroring the distal relationship). The dysfunc- singing (pp. 119). Oxford, United Kingdom: Oxford University Press.
tional links identified in Figures 2 to 4 concern mostly the faulty Eyben, F., Salomo, L. G., Sundberg, J., Scherer, K. R., & Schuller, B. W.
interpretation of proximal cues that do not have sufficient validity (2015). Emotion in the singing voice: A deeper look at acoustic features
as distal indicators of the underlying emotional expression. Other in the light of automatic classification. EURASIP Journal on Audio,
Speech, and Music Processing, 2015. http://dx.doi.org/10.1186/s13636-
possible dysfunctions include faulty transmission of distal cues to
015-0057-6
the proximal side. We strongly believe that the systematic use of Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., Andr, E., Busso,
the TEEP model in cases in which a sufficient number of appro- C., . . . Truong, K. P. (2016). The Geneva Minimalistic Acoustic
priate observations is available should allow the establishment of Parameter Set (GeMAPS) for voice research and affective computing.
the goodness of fit of theoretically predicted expression-inference IEEE Transactions on Affective Computing, 7, 190 202. http://dx.doi
relationships with techniques such as hierarchical regression or .org/10.1109/TAFFC.2015.2457417
structural equation modeling (Bnziger et al., 2015). Foulds-Elliott, S. D., Thorpe, C. W., Cala, S. J., & Davis, P. J. (2000).
In conclusion, we hope to have shown the utility and feasibility Respiratory function in operatic singing: Effects of emotional connec-
of studying emotional communication in the singing voice, a tion. Logopedics, Phoniatrics, Vocology, 25, 151168. http://dx.doi.org/
10.1080/140154300750067539
largely neglected area of research. The results of further studies in
Fritz, T., Jentschke, S., Gosselin, N., Sammler, D., Peretz, I., Turner, R.,
this domain would not only enrich our general knowledge about . . . Koelsch, S. (2009). Universal recognition of three basic emotions in
emotion communication in the vocal channel, but would also help music. Current Biology, 19, 573576. http://dx.doi.org/10.1016/j.cub
researchers empirically study issues in the psychology of music .2009.02.058
that have so far eluded empirical scrutiny. One example concerns Juslin, P. N. (2000). Cue utilization in communication of emotion in music
the degree to which singers can successfully produce authentic performance: Relating performance to perception. Journal of Experi-
12 SCHERER, TRZNADEL, FANTINI, AND SUNDBERG

mental Psychology: Human Perception and Performance, 26, 1797 Scherer, K. R. (2013a). Affect bursts as evolutionary precursors of speech
1812. http://dx.doi.org/10.1037/0096-1523.26.6.1797 and music. In G. A. Danieli, A. Minelli, & T. Pievani (Eds.), Stephen J.
Juslin, P. N., & Laukka, P. (2001). Impact of intended emotion intensity on Gould: The scientific legacy (pp. 147167). Milan, Italy: Springer-
cue utilization and decoding accuracy in vocal expression of emotion. Verlag. http://dx.doi.org/10.1007/978-88-470-5424-0_10
Emotion, 1, 381 412. http://dx.doi.org/10.1037/1528-3542.1.4.381 Scherer, K. R. (2013b). Emotion in action, interaction, music, and speech.
Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal In M. A. Arbib (Ed.), Language, music, and the brain: A mysterious
expression and music performance: Different channels, same code? relationship (pp. 107140). Cambridge, MA: MIT Press. http://dx.doi
Psychological Bulletin, 129, 770 814. http://dx.doi.org/10.1037/0033- .org/10.7551/mitpress/9780262018104.003.0005
2909.129.5.770 Scherer, K. R. (2013c). The singers paradox: On authenticity in emotional
Juslin, P. N., & Scherer, K. R. (2005). Vocal expression of affect. In J. A. expression on the opera stage. In T. Cochrane, B. Fantini, & K. R.
Harrigan, R. Rosenthal, & K. Scherer (Eds.), The new handbook of Scherer (Eds.), The emotional power of music (pp. 5573). Oxford,
methods in nonverbal behavior research (pp. 65135). Oxford, United United Kingdom: Oxford University Press. http://dx.doi.org/10.1093/
Kingdom: Oxford University Press. acprof:oso/9780199654888.003.0005
Matsumoto, D., Keltner, D., Shiota, M. N., Frank, M. G., & OSullivan, M. Scherer, K. R., Clark-Polner, E., & Mortillaro, M. (2011). In the eye of the
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

(2008). Whats in a face? Facial expressions as signals of discrete beholder? Universality and cultural specificity in the expression and
perception of emotion. International Journal of Psychology, 46, 401
This document is copyrighted by the American Psychological Association or one of its allied publishers.

emotions. In M. Lewis, J. M. Haviland, & L. Feldman Barrett (Eds.),


Handbook of emotions (pp. 211234). New York, NY: Guilford Press. 435. http://dx.doi.org/10.1080/00207594.2011.626049
Mithen, S. (2005). The singing Neanderthals: The origins of music, lan- Scherer, K. R., Johnstone, T., & Klasmeyer, G. (2003). Vocal expression
guage, mind and body. London, United Kingdom: Weidenfeld and of emotion. In R. J. Davidson, K. R. Scherer, & H. Goldsmith (Eds.),
Nicolson. Handbook of the affective sciences (pp. 433 456). New York, NY:
Pell, M. D., & Kotz, S. A. (2011). On the time course of vocal emotion Oxford University Press.
recognition. PLoS ONE, 6, e27256. http://dx.doi.org/10.1371/journal Scherer, K. R., Sundberg, J., Fantini, B., Trznadel, S., & Eyben, F. (in
.pone.0027256 press). The expression of emotion in the singing voice: Acoustic patterns
Pell, M. D., Paulmann, S., Dara, C., Alasseri, A., & Kotz, S. A. (2009). in vocal performance. Journal of the Acoustical Society of America.
Factors in the recognition of vocally expressed emotions: A comparison Scherer, K. R., Sundberg, J., Tamarit, L., & Salomo, G. L. (2015).
of four languages. Journal of Phonetics, 37, 417 435. http://dx.doi.org/ Comparing the acoustic expression of emotion in the speaking and the
10.1016/j.wocn.2009.07.005 singing voice. Computer Speech & Language, 29, 218 235. http://dx
Scherer, K. R. (1978). Personality inference from voice quality: The loud .doi.org/10.1016/j.csl.2013.10.002
voice of extroversion. European Journal of Social Psychology, 8, 467 Scherer, K. R., Wranik, T., Sangsue, J., Tran, V., & Scherer, U. (2004).
487. http://dx.doi.org/10.1002/ejsp.2420080405 Emotions in everyday life: Probability of occurrence, risk factors, ap-
Scherer, K. R. (1986). Vocal affect expression: A review and a model for praisal and reaction pattern. Social Sciences Information Sur les Sciences
future research. Psychological Bulletin, 99, 143165. http://dx.doi.org/ Sociales, 43, 499 570. http://dx.doi.org/10.1177/0539018404047701
10.1037/0033-2909.99.2.143 Sherman, M. (1928). Emotional character of the singing voice. Journal of
Scherer, K. R. (1991). Emotion expression in speech and music. In J. Experimental Psychology, 11, 495 497. http://dx.doi.org/10.1037/
Sundberg, L. Nord, & R. Carlson (Eds.), Music, language, speech, and h0075703
brain (pp. 146 156), Wenner-Gren Center International Symposium Sundberg, J., L, F. M., & Himonides, E. (2013). Intonation and expres-
Series. London, United Kingdom: Macmillan. sivity: A single case study of classical western singing. Journal of Voice,
Scherer, K. R. (1995). Expression of emotion in voice and music. Journal 27, 391.e1391.e8. http://dx.doi.org/10.1016/j.jvoice.2012.11.009
of Voice, 9, 235248. http://dx.doi.org/10.1016/S0892-1997(05)80231-0
Scherer, K. R. (2003). Vocal communication of emotion: A review of Received February 20, 2017
research paradigms. Speech Communication, 40, 227256. http://dx.doi Revision received July 14, 2017
.org/10.1016/S0167-6393(02)00084-5 Accepted August 20, 2017

Potrebbero piacerti anche