Sei sulla pagina 1di 766

1 The Perception of Musical Tones

Andrew J. Oxenham
Department of Psychology, University of Minnesota, Minneapolis

I.

Introduction

A. What Are Musical Tones?


The definition of a tonea periodic sound that elicits a pitch sensationencompasses the vast majority of musical sounds. Tones can be either puresinusoidal variations in air pressure at a single frequencyor complex. Complex tones can be divided into two categories, harmonic and inharmonic. Harmonic complex tones are periodic, with a repetition rate known as the fundamental frequency (F0), and are composed of a sum of sinusoids with frequencies that are all integer multiples, or harmonics, of the F0. Inharmonic complex tones are composed of multiple sinusoids that are not simple integer multiples of any common F0. Most musical instrumental or vocal tones are more or less harmonic but some, such as bell chimes, can be inharmonic.

B. Measuring Perception
The physical attributes of a sound, such as its intensity and spectral content, can be readily measured with modern technical instrumentation. Measuring the perception of sound is a different matter. Gustav Fechner, a 19th-century German scientist, is credited with founding the field of psychophysicsthe attempt to establish a quantitative relationship between physical variables (e.g., sound intensity and frequency) and the sensations they produce (e.g., loudness and pitch; Fechner, 1860). The psychophysical techniques that have been developed since Fechners time to tap into our perceptions and sensations (involving hearing, vision, smell, touch, and taste) can be loosely divided into two categories of measures, subjective and objective. The subjective measures typically require participants to estimate or produce magnitudes or ratios that relate to the dimension under study. For instance, in establishing a loudness scale, participants may be presented with a series of tones at different intensities and then asked to assign a number to each tone, corresponding to its loudness. This method of magnitude estimation thus produces a psychophysical function that directly relates loudness to sound intensity. Ratio estimation follows the same principle, except that participants may be presented with two
The Psychology of Music. DOI: http://dx.doi.org/10.1016/B978-0-12-381460-9.00001-8 2013 Elsevier Inc. All rights reserved.

Andrew J. Oxenham

sounds and then asked to judge how much louder (e.g., twice or three times) one sound is than the other. The complementary methods are magnitude production and ratio production. In these production techniques, the participants are required to vary the relevant physical dimension of a sound until it matches a given magnitude (number), or until it matches a specific ratio with respect to a reference sound. In the latter case, the instructions may be something like adjust the level of the second sound until it is twice as loud as the first sound. All four techniques have been employed numerous times in attempts to derive appropriate psychophysical scales (e.g., Buus, Muesch, & Florentine, 1998; Hellman, 1976; Hellman & Zwislocki, 1964; Stevens, 1957; Warren, 1970). Other variations on these methods include categorical scaling and cross-modality matching. Categorical scaling involves asking participants to assign the auditory sensation to one of a number of fixed categories; following our loudness example, participants might be asked to select a category ranging from very quiet to very loud (e.g., Mauermann, Long, & Kollmeier, 2004). Cross-modality matching avoids the use of numbers by, for instance, asking participants to adjust the length of a line, or a piece of string, to match the perceived loudness of a tone (e.g., Epstein & Florentine, 2005). Although all these methods have the advantage of providing a more-or-less direct estimate of the relationship between the physical stimulus and the sensation, they have a number of disadvantages also. First, they are subjective and rely on introspection on the part of the subject. Perhaps because of this they can be somewhat unreliable, variable across and within participants, and prone to various biases (e.g., Poulton, 1977). The other approach is to use an objective measure, where a right and wrong answer can be verified externally. This approach usually involves probing the limits of resolution of the sensory system, by measuring absolute threshold (the smallest detectable stimulus), relative threshold (the smallest detectable change in a stimulus), or masked threshold (the smallest detectable stimulus in the presence of another stimulus). There are various ways of measuring threshold, but most involve a forcedchoice procedure, where the subject has to pick the interval that contains the target sound from a selection of two or more. For instance, in an experiment measuring absolute threshold, the subject might be presented with two successive time intervals, marked by lights; the target sound is played during one of the intervals, and the subject has to decide which one it was. One would expect performance to change with the intensity of the sound: at very low intensities, the sound will be completely inaudible, and so performance will be at chance (50% correct in a two-interval task); at very high intensities, the sound will always be clearly audible, so performance will be near 100%, assuming that the subject continues to pay attention. A psychometric function can then be derived, which plots the performance of a subject as a function of the stimulus parameter. An example of a psychometric function is shown in Figure 1, which plots percent correct as a function of sound pressure level. This type of forced-choice paradigm is usually preferable (although often more time-consuming) than more subjective measures, such as the method of limits, which is often used today to measure audiograms. In the method of limits, the intensity of a sound is decreased until the subject reports no longer being able to hear it, and then the intensity of the sound is increased until the subject again reports being able to hear it.

1. The Perception of Musical Tones

100 Percent correct 90 80 70 60 50 5 0 10 5 Signal level (dB SPL) 15

Figure 1 A schematic example of a psychometric function, plotting percent correct in a two-alternative forced-choice task against the sound pressure level of a test tone.

The trouble with such measures is that they rely not just on sensitivity but also on criterionhow willing the subject is to report having heard a sound if he or she is not sure. A forced-choice procedure eliminates that problem by forcing participants to guess, even if they are unsure which interval contained the target sound. Clearly, testing the perceptual limits by measuring thresholds does not tell us everything about human auditory perception; a primary concern is that these measures are typically indirectthe finding that people can detect less than a 1% change in frequency does not tell us much about the perception of much larger musical intervals, such as an octave. Nevertheless it has proved extremely useful in helping us to gain a deeper understanding of perception and its relation to the underlying physiology of the ear and brain. Measures of reaction time, or response time (RT), have also been used to probe sensory processing. The two basic forms of response time are simple response time (SRT), where participants are instructed to respond as quickly as possible by pushing a single button once a stimulus is presented, and choice response time (CRT), where participants have to categorize the stimulus (usually into one of two categories) before responding (by pressing button 1 or 2). Although RT measures are more common in cognitive tasks, they also depend on some basic sound attributes, such as sound intensity, with higher intensity sounds eliciting faster reactions, measured using both SRTs (Kohfeld, 1971; Luce & Green, 1972) and CRTs (Keuss & van der Molen, 1982). Finally, measures of perception are not limited to the quantitative or numerical domain. It is also possible to ask participants to describe their percepts in words. This approach has clear applications when dealing with multidimensional attributes, such as timbre (see below, and Chapter 2 of this volume), but also has some inherent difficulties, as different people may use descriptive words in different ways. To sum up, measuring perception is a thorny issue that has many solutions, all with their own advantages and shortcomings. Perceptual measures remain a crucial systems-level analysis tool that can be combined in both human and animal studies with various physiological and neuroimaging techniques, to help us discover more about how the ears and brain process musical sounds in ways that elicit musics powerful cognitive and emotional effects.

Andrew J. Oxenham

II.

Perception of Single Tones

Although a single tone is a far cry from the complex combinations of sound that make up most music, it can be a useful place to start in order to make sense of how music is perceived and represented in the auditory system. The sensation produced by a single tone is typically divided into three categoriesloudness, pitch, and timbre.

A. Loudness
The most obvious physical correlate of loudness is sound intensity (or sound pressure) measured at the eardrum. However, many other factors also influence the loudness of a sound, including its spectral content, its duration, and the context in which it is presented.

1. Dynamic Range and the Decibel


The human auditory system has an enormous dynamic range, with the lowest-intensity sound that is audible being about a factor of 1,000,000,000,000 less intense than the loudest sound that does not cause immediate hearing damage. This very large range is one reason why a logarithmic scalethe decibel or dBis used to describe sound level. In these units, the dynamic range of hearing corresponds to about 120 dB. Sound intensity is proportional to the square of sound pressure, which is often described in terms sound pressure level (SPL) using a pressure, P0, of 2 3 1025 N m22 or 20 Pa (micropascals) as the reference, which is close to the average absolute threshold for medium-frequency pure tones in young normalhearing individuals. The SPL of a given sound pressure, P1, is then defined as 20log10(P1/P0). A similar relationship exists between sound intensity and sound level, such that the level is given by 10log10(I1/I0). (The multiplier is now 10 instead of 20 because of the square-law relationship between intensity and pressure.) Thus, a sound level in decibels is always a ratio and not an absolute value. The dynamic range of music depends on the music style. Modern classical music can have a very large dynamic range, from pianissimo passages on a solo instrument (roughly 45 dB SPL) to a full orchestra playing fortissimo (about 95 dB SPL), as measured in concert halls (Winckel, 1962). Pop music, which is often listened to in less-than-ideal conditions, such as in a car or on a street, generally has a much smaller dynamic range. Radio broadcast stations typically reduce the dynamic range even further using compression to make their signal as consistently loud as possible without exceeding the maximum peak amplitude of the broadcast channel, so that the end dynamic range is rarely more than about 10 dB. Our ability to discriminate small changes in level has been studied in great depth for a wide variety of sounds and conditions (e.g., Durlach & Braida, 1969; Jesteadt, Wier, & Green, 1977; Viemeister, 1983). As a rule of thumb, we are able to discriminate changes on the order of 1 dBcorresponding to a change in sound pressure of about 12%. The fact that the size of the just-noticeable difference (JND) of

1. The Perception of Musical Tones

broadband sounds remains roughly constant when expressed as a ratio or in decibels is in line with the well-known Webers law, which states that the JND between two stimuli is proportional to the magnitude of the stimuli. In contrast to our ability to judge differences in sound level between two sounds presented one after another, our ability to categorize or label sound levels is rather poor. In line with Millers (1956) famous 7 plus or minus 2 postulate for information processing and categorization, our ability to categorize sound levels accurately is fairly limited and is subject to a variety of influences, such as the context of the preceding sounds. This may explain why the musical notation of loudness (in contrast to pitch) has relatively few categories between pianissimo and fortissimotypically just six (pp, p, mp, mf, f, and ff).

2. Equal Loudness Contours and the Loudness Weighting Curves


There is no direct relationship between the physical sound level (in dB SPL) and the sensation of loudness. There are many reasons for this, but an important one is that loudness depends heavily on the frequency content of the sound. Figure 2 shows what are known as equal loudness contours. The basic concept is that two pure tones with different frequencies, but with levels that fall on the same loudness contour, have the same loudness. For instance, as shown in Figure 2, a pure tone with a frequency of 1 kHz and a level of 40 dB SPL has the same loudness as a pure tone with a frequency of 100 Hz and a level of about 64 dB SPL; in other words, a 100-Hz tone has to be 24 dB higher in level than a 40-dB SPL 1-kHz tone in order
130 120 110 100
100 phons 90 80 70 60 50 40 30 20 10 Hearing threshold

Sound pressure level in dB

90 80 70 60 50 40 30 20 10

0
10 16 31,5 63 125

250

500 1000 2000 4000 8000 16000

Frequency in Hz

Figure 2 The equal-loudness contours, taken from ISO 226:2003. Original figure kindly provided by Brian C. J. Moore.

Andrew J. Oxenham

to be perceived as being equally loud. The equal loudness contours are incorporated into an international standard (ISO 226) that was initially established in 1961 and was last revised in 2003. These equal loudness contours have been derived several times from painstaking psychophysical measurements, not always with identical outcomes (Fletcher & Munson, 1933; Robinson & Dadson, 1956; Suzuki & Takeshima, 2004). The measurements typically involve either loudness matching, where a subject adjusts the level of one tone until it sounds as loud as a second tone, or loudness comparisons, where a subject compares the loudness of many pairs of tones and the results are compiled to derive points of subjective equality (PSE). Both methods are highly susceptible to nonsensory biases, making the task of deriving a definitive set of equal loudness contours a challenging one (Gabriel, Kollmeier, & Mellert, 1997). The equal loudness contours provide the basis for the measure of loudness level, which has units of phons. The phon value of a sound is the dB SPL value of a 1-kHz tone that is judged to have the same loudness as the sound. So, by definition, a 40-dB SPL tone at 1 kHz has a loudness level of 40 phons. Continuing the preceding example, the 100-Hz tone at a level of about 64 dB SPL also has a loudness level of 40 phons, because it falls on the same equal loudness contour as the 40-dB SPL 1-kHz tone. Thus, the equal loudness contours can also be termed the equal phon contours. Although the actual measurements are difficult, and the results somewhat contentious, there are many practical uses for the equal loudness contours. For instance, in issues of community noise annoyance from rock concerts or airports, it is more useful to know about the perceived loudness of the sounds in question, rather than just their physical level. For this reason, an approximation of the 40-phon equal loudness contour is built into most modern sound level meters and is referred to as the A-weighted curve. A sound level that is quoted in dB (A) is an overall sound level that has been filtered with the inverse of the approximate 40-phon curve. This means that very low and very high frequencies, which are perceived as being less loud, are given less weight than the middle of the frequency range. As with all useful tools, the A-weighted curve can be misused. Because it is based on the 40-phon curve, it is most suitable for low-level sounds; however, that has not prevented it from being used in measurements of much higher-level sounds, where a flatter filter would be more appropriate, such as that provided by the much-less-used C-weighted curve. The ubiquitous use of the dB (A) scale for all levels of sound therefore provides an example of a case where the convenience of a single-number measure (and one that minimizes the impact of difficult-to-control low frequencies) has outweighed the desire for accuracy.

3. Loudness Scales
Equal loudness contours and phons tell us about the relationship between loudness and frequency. They do not, however, tell us about the relationship between loudness and sound level. For instance, the phon, based as it is on the decibel scale at 1 kHz, says nothing about how much louder a 60-dB SPL tone is than a 30-dB

1. The Perception of Musical Tones

SPL tone. The answer, according to numerous studies of loudness, is not twice as loud. There have been numerous attempts since Fechners day to relate the physical sound level to loudness. Fechner (1860), building on Webers law, reasoned that if JNDs were constant on a logarithmic scale, and if equal numbers of JNDs reflected an equal change in loudness, then loudness must be related logarithmically to sound intensity. Harvard psychophysicist S. S. Stevens disagreed, claiming that JNDs reflected noise in the auditory system, which did not provide direct insight into the function relating loudness to sound intensity (Stevens, 1957). Stevenss approach was to use magnitude and ratio estimation and production techniques, as described in Section I of this chapter, to derive a relationship between loudness and sound intensity. He concluded that loudness (L) was related to sound intensity (I) by a power law: L 5 kI (Eq. 1)

where the exponent, , has a value of about 0.3 at medium frequencies and for moderate and higher sound levels. This law implies that a 10-dB increase in level results in a doubling of loudness. At low levels, and at lower frequencies, the exponent is typically larger, leading to a steeper growth-of-loudness function. Stevens used this relationship to derive loudness units, called sones. By definition, 1 sone is the loudness of a 1-kHz tone presented at a level of 40 dB SPL; 2 sones is twice as loud, corresponding roughly to a 1-kHz tone presented at 50 dB SPL, and 4 sones corresponds to the same tone at about 60 dB SPL. Numerous studies have supported the basic conclusion that loudness can be related to sound intensity by a power law. However, in part because of the variability of loudness judgments, and the substantial effects of experimental methodology (Poulton, 1979), different researchers have found different values for the best-fitting exponent. For instance, Warren (1970) argued that presenting participants with several sounds to judge invariably results in bias. He therefore presented each subject with only one trial. Based on these single-trial judgments, Warren also derived a power law, but he found an exponent value of 0.5. This exponent value is what one might expect if the loudness of sound were proportional to its distance from the receiver, leading to a 6-dB decrease in level for every doubling of distance. Yet another study, which tried to avoid bias effects by using the entire (100-dB) level range within each experiment, derived an exponent of only 0.1, implying a doubling of loudness for every 30-dB increase in sound level (Viemeister & Bacon, 1988). Overall, it is generally well accepted that the relationship between loudness and sound intensity can be approximated as a power law, although methodological issues and intersubject and intrasubject variability have made it difficult to derive a definitive and uncontroversial function relating the sensation to the physical variable.

4. Partial Loudness and Context Effects


Most sounds that we encounter, particularly in music, are accompanied by other sounds. This fact makes it important to understand how the loudness of a sound is

Andrew J. Oxenham

affected by the context in which it is presented. In this section, we deal with two such situations, the first being when sounds are presented simultaneously, the second when they are presented sequentially. When two sounds are presented together, as in the case of two musical instruments or voices, they may partially mask each other, and the loudness of each may not be as great as if each sound were presented in isolation. The loudness of a partially masked sound is termed partial loudness (Moore, Glasberg, & Baer, 1997; Scharf, 1964; Zwicker, 1963). When a sound is completely masked by another, its loudness is zero, or a very small quantity. As its level is increased to above its masked threshold, it becomes audible, but its loudness is lowsimilar to that of the same sound presented in isolation but just a few decibels above its absolute threshold. As the level is increased further, the sounds loudness increases rapidly, essentially catching up with its unmasked loudness once it is about 20 dB or more above its masked threshold. The loudness of a sound is also affected by the sounds that precede it. In some cases, loud sounds can enhance the loudness of immediately subsequent sounds (e.g., Galambos, Bauer, Picton, Squires, & Squires, 1972; Plack, 1996); in other cases, the loudness of the subsequent sounds can be reduced (Mapes-Riordan & Yost, 1999; Marks, 1994). There is still some debate as to whether separate mechanisms are required to explain these two phenomena (Arieh & Marks, 2003b; Oberfeld, 2007; Scharf, Buus, & Nieder, 2002). Initially, it was not clear whether the phenomenon of loudness recalibrationa reduction in the loudness of moderate-level sounds following a louder onereflected a change in the way participants assigned numbers to the perceived loudness, or reflected a true change in the loudness sensation (Marks, 1994). However, more recent work has shown that choice response times to recalibrated stimuli change in a way that is consistent with physical changes in the intensity, suggesting a true sensory phenomenon (Arieh & Marks, 2003a).

5. Models of Loudness
Despite the inherent difficulties in measuring loudness, a model that can predict the loudness of arbitrary sounds is still a useful tool. The development of models of loudness perception has a long history (Fletcher & Munson, 1937; Moore & Glasberg, 1996, 1997; Moore et al., 1997; Moore, Glasberg, & Vickers, 1999; Zwicker, 1960; Zwicker, Fastl, & Dallmayr, 1984). Essentially all are based on the idea that the loudness of a sound reflects the amount of excitation it produces within the auditory system. Although a direct physiological test, comparing the total amount of auditory nerve activity in an animal model with the predicted loudness based on human studies, did not find a good correspondence between the two (Relkin & Doucet, 1997), the psychophysical models that relate predicted excitation patterns, based on auditory filtering and cochlear nonlinearity, to loudness generally provide accurate predictions of loudness in a wide variety of conditions (e.g., Chen, Hu, Glasberg, & Moore, 2011). Some models incorporate partial loudness predictions (Chen et al., 2011; Moore et al., 1997), others predict the effects of cochlear hearing loss on loudness

1. The Perception of Musical Tones

(Moore & Glasberg, 1997), and others have been extended to explain the loudness of sounds that fluctuate over time (Chalupper & Fastl, 2002; Glasberg & Moore, 2002). However, none has yet attempted to incorporate context effects, such as loudness recalibration or loudness enhancement.

B. Pitch
Pitch is arguably the most important dimension for conveying music. Sequences of pitches form a melody, and simultaneous combinations of pitches form harmony two foundations of Western music. There is a vast body of literature devoted to pitch research, from both perceptual and neural perspectives (Plack, Oxenham, Popper, & Fay, 2005). The clearest physical correlate of pitch is the periodicity, or repetition rate, of sound, although other dimensions, such as sound intensity, can have small effects (e.g., Verschuure & van Meeteren, 1975). For young people with normal hearing, pure tones with frequencies between about 20 Hz and 20 kHz are audible. However, only sounds with repetition rates between about 30 Hz and 5 kHz elicit a pitch percept that can be called musical and is strong enough to carry a melody (e.g., Attneave & Olson, 1971; Pressnitzer, Patterson, & Krumbholz, 2001; Ritsma, 1962). Perhaps not surprisingly, these limits, which were determined through psychoacoustical investigation, correspond quite well to the lower and upper limits of pitch found on musical instruments: the lowest and highest notes of a modern grand piano, which covers the ranges of all standard orchestral instruments, correspond to 27.5 Hz and 4186 Hz, respectively. We tend to recognize patterns of pitches that form melodies (see Chapter 7 of this volume). We do this presumably by recognizing the musical intervals between successive notes (see Chapters 4 and 7 of this volume), and most of us seem relatively insensitive to the absolute pitch values of the individual note, so long as the pitch relationships between notes are correct. However, exactly how the pitch is extracted from each note and how it is represented in the auditory system remain unclear, despite many decades of intense research.

1. Pitch of Pure Tones


Pure tones produce a clear, unambiguous pitch, and we are very sensitive to changes in their frequency. For instance, well-trained listeners can distinguish between two tones with frequencies of 1000 and 1002 Hza difference of only 0.2% (Moore, 1973). A semitone, the smallest step in the Western scale system, is a difference of about 6%, or about a factor of 30 greater than the JND of frequency for pure tones. Perhaps not surprisingly, musicians are generally better than nonmusicians at discriminating small changes in frequency; what is more surprising is that it does not take much practice for people with no musical training to catch up with musicians in terms of their performance. In a recent study, frequency discrimination abilities of trained classical musicians were compared with those of untrained listeners with no musical background, using both pure tones and complex tones (Micheyl, Delhommeau, Perrot, & Oxenham, 2006). Initially thresholds were about a factor of 6 worse for the untrained listeners.

10

Andrew J. Oxenham

However, it took only between 4 and 8 hours of practice for the thresholds of the untrained listeners to match those of the trained musicians, whereas the trained musicians did not improve with practice. This suggests that most people are able to discriminate very fine differences in frequency with very little in the way of specialized training. Two representations of a pure tone at 440 Hz (the orchestral A) are shown in Figure 3. The upper panel shows the waveformvariations in sound pressure as a function of timethat repeats 440 times a second, and so has a period of 1/440 s, or about 2.27 ms. The lower panel provides the spectral representation, showing that the sound has energy only at 440 Hz. This spectral representation is for an ideal pure toneone that has no beginning or end. In practice, spectral energy spreads above and below the frequency of the pure tone, reflecting the effects of onset and offset. These two representations (spectral and temporal) provide a good introduction to two ways in which pure tones are represented in the peripheral auditory system. The first potential code, known as the place code, reflects the mechanical filtering that takes place in the cochlea of the inner ear. The basilar membrane, which runs the length of the fluid-filled cochlea from the base to the apex, vibrates in

1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1

Pressure (arbitrary units)

Figure 3 Schematic diagram of the time waveform (upper panel) and power spectrum (lower panel) of a pure tone with a frequency of 440 Hz.

6 Time (ms)

10

12

Magnitude (arbitrary units)

1 0.8 0.6 0.4 0.2 0

1000

2000

3000

4000

5000

Frequency (Hz)

1. The Perception of Musical Tones

11

response to sound. The responses of the basilar membrane are sharply tuned and highly specific: a certain frequency will cause only a local region of the basilar membrane to vibrate. Because of its structural properties, the apical end of the basilar membrane responds best to low frequencies, while the basal end responds best to high frequencies. Thus, every place along the basilar membrane has its own best frequency or characteristic frequency (CF)the frequency to which that place responds most strongly. This frequency-to-place mapping, or tonotopic organization, is maintained throughout the auditory pathways up to primary auditory cortex, thereby providing a potential neural code for the pitch of pure tones. The second potential code, known as the temporal code, relies on the fact that action potentials, or spikes, generated in the auditory nerve tend to occur at a certain phase within the period of a sinusoid. This property, known as phase locking, means that the brain could potentially represent the frequency of a pure tone by way of the time intervals between spikes, when pooled across the auditory nerve. No data are available from the human auditory nerve, because of the invasive nature of the measurements, but phase locking has been found to extend to between 2 and 4 kHz in other mammals, depending somewhat on the species. Unlike tonotopic organization, phase locking up to high frequencies is not preserved in higher stations of the auditory pathways. At the level of the auditory cortex, the limit of phase locking reduces to at best 100 to 200 Hz (Wallace, Rutkowski, Shackleton, & Palmer, 2000). Therefore, most researchers believe that the timing code found in the auditory nerve must be transformed to some form of place or population code at a relatively early stage of auditory processing. There is some psychoacoustical evidence for both place and temporal codes. One piece of evidence in favor of a temporal code is that pitch discrimination abilities deteriorate at high frequencies: the JND between two frequencies becomes considerably larger at frequencies above about 4 to 5 kHzthe same frequency range above which listeners ability to recognize familiar melodies (Attneave & Olson, 1971), or to notice subtle changes in unfamiliar melodies (Oxenham, Micheyl, Keebler, Loper, & Santurette, 2011), degrades. This frequency is similar to the one just described in which phase locking in the auditory nerve is strongly degraded (e.g., Palmer & Russell, 1986; Rose, Brugge, Anderson, & Hind, 1967), suggesting that the temporal code is necessary for accurate pitch discrimination and for melody perception. It might even be taken as evidence that the upper pitch limits of musical instruments were determined by the basic physiological limits of the auditory nerve. Evidence for the importance of place information comes first from the fact that some form of pitch perception remains possible even with pure tones of very high frequency (Henning, 1966; Moore, 1973), where it is unlikely that phase locking information is useful (e.g., Palmer & Russell, 1986). Another line of evidence indicating that place information may be important comes from a study that used socalled transposed tones (van de Par & Kohlrausch, 1997) to present the temporal information that would normally be available only to a low-frequency region in the cochlea to a high-frequency region, thereby dissociating temporal from place cues (Oxenham, Bernstein, & Penagos, 2004). In that study, pitch discrimination was

12

Andrew J. Oxenham

considerably worse when the low-frequency temporal information was presented to the wrong place in the cochlea, suggesting that place information is important. In light of this mixed evidence, it may be safest to assume that the auditory system uses both place and timing information from the auditory nerve in order to extract the pitch of pure tones. Indeed some theories of pitch explicitly require both accurate place and timing information (Loeb, White, & Merzenich, 1983). Gaining a better understanding of how the information is extracted remains an important research goal. The question is of particular clinical relevance, as deficits in pitch perception are a common complaint of people with hearing loss and people with cochlear implants. A clearer understanding of how the brain uses information from the cochlea will help researchers to improve the way in which auditory prostheses, such as hearing aids and cochlear implants, present sound to their users.

2. Pitch of Complex Tones


A large majority of musical sounds are complex tones of one form or another, and most have a pitch associated with them. Most common are harmonic complex tones, which are composed of the F0 (corresponding to the repetition rate of the entire waveform) and upper partials, harmonics, or overtones, spaced at integer multiples of the F0. The pitch of a harmonic complex tone usually corresponds to the F0. In other words, if a subject is asked to match the pitch of a complex tone to the pitch of a single pure tone, the best match usually occurs when the frequency of the pure tone is the same as the F0 of the complex tone. Interestingly, this is true even when the complex tone has no energy at the F0 or the F0 is masked (de Boer, 1956; Licklider, 1951; Schouten, 1940; Seebeck, 1841). This phenomenon has been given various terms, including pitch of the missing fundamental, periodicity pitch, residue pitch, and virtual pitch. The ability of the auditory system to extract the F0 of a sound is important from the perspective of perceptual constancy: imagine a violin note being played in a quiet room and then again in a room with a noisy air-conditioning system. The low-frequency noise of the air-conditioning system might well mask some of the lower-frequency energy of the violin, including the F0, but we would not expect the pitch (or identity) of the violin to change because of it. Although the ability to extract the periodicity pitch is clearly an important one, and one that is shared by many different species (Shofner, 2005), exactly how the auditory system extracts the F0 remains for the most part unknown. The initial stages in processing a harmonic complex tone are shown in Figure 4. The upper two panels show the time waveform and the spectral representation of a harmonic complex tone. The third panel depicts the filtering that occurs in the cochleaeach point along the basilar membrane can be represented as a band-pass filter that responds to only those frequencies close to its center frequency. The fourth panel shows the excitation pattern produced by the sound. This is the average response of the bank of band-pass filters, plotted as a function of the filters center frequency (Glasberg & Moore, 1990). The fifth panel shows an excerpt of the time waveform at the output of some of the filters along the array. This is an approximation of the

Pressure (arbitrary units)

Time waveform 2 1 0 1 2 0 2 4 6 Time (ms) 8 10 12

Spectrum 0 10 Level (dB) 20 30 40

1000

2000

Auditory filterbank 0 Response (dB) 10 20 30 40

3000 4000 5000 Frequency (Hz)

6000

7000

8000

1000

2000

Excitation pattern 0 10 20 30 40

3000 4000 5000 Frequency (Hz)

6000

7000

8000

Excitation (dB)

1000

2000

3000 4000 5000 Center frequency (Hz)

6000

7000

8000

0 2 4 6 8 10 12

BM vibration

Figure 4 Representations of a harmonic complex tone with a fundamental frequency (F0) of 440 Hz. The upper panel shows the time waveform. The second panel shows the power spectrum of the same waveform. The third panel shows the auditory filter bank, representing the filtering that occurs in the cochlea. The fourth panel shows the excitation pattern, or the time-averaged output of the filter bank. The fifth panel shows some sample time waveforms at the output of the filter bank, including filters centered at the F0 and the fourth harmonic, illustrating resolved harmonics, and filters centered at the 8th and 12th harmonic of the complex, illustrating harmonics that are less well resolved and show amplitude modulations at a rate corresponding to the F0.

Time (ms)

14

Andrew J. Oxenham

waveform that drives the inner hair cells in the cochlea, which in turn synapse with the auditory nerve fibers to produce the spike trains that the brain must interpret. Considering the lower two panels of Figure 4, it is possible to see a transition as one moves from the low-numbered harmonics on the left to the highnumbered harmonics on the right: The first few harmonics generate distinct peaks in the excitation pattern, because the filters in that frequency region are narrower than the spacing between successive harmonics. Note also that the time waveforms at the outputs of filters centered at the low-numbered harmonics resemble pure tones. At higher harmonic numbers, the bandwidths of the auditory filters become wider than the spacing between successive harmonics, and so individual peaks in the excitation pattern are lost. Similarly, the time waveform at the output of higherfrequency filters no longer resembles a pure tone, but instead reflects the interaction of multiple harmonics, producing a complex waveform that repeats at a rate corresponding to the F0. Harmonics that produce distinct peaks in the excitation pattern and/or produce quasi-sinusoidal vibrations on the basilar membrane are referred to as being resolved. Phenomenologically, resolved harmonics are those that can be heard out as separate tones under certain circumstances. Typically, we do not hear the individual harmonics when we listen to a musical tone, but our attention can be drawn to them in various ways, for instance by amplifying them or by switching them on and off while the other harmonics remain continuous (e.g., Bernstein & Oxenham, 2003; Hartmann & Goupell, 2006). The ability to resolve or hear out individual low-numbered harmonics as pure tones was already noted by Hermann von Helmholtz in his classic work, On the Sensations of Tone Perception (Helmholtz, 1885/1954). The higher-numbered harmonics, which do not produce individual peaks of excitation and cannot typically be heard out, are often referred to as being unresolved. The transition between resolved and unresolved harmonics is thought to lie somewhere between the 5th and 10th harmonic, depending on various factors, such as the F0 and the relative amplitudes of the components, as well as on how resolvability is defined (e.g., Bernstein & Oxenham, 2003; Houtsma & Smurzynski, 1990; Moore & Gockel, 2011; Shackleton & Carlyon, 1994). Numerous theories and models have been devised to explain how pitch is extracted , 2005). As with from the information present in the auditory periphery (de Cheveigne pure tones, the theories can be divided into two basic categoriesplace and temporal theories. The place theories generally propose that the auditory system uses the lower-order, resolved harmonics to calculate the pitch (e.g., Cohen, Grossberg, & Wyse, 1995; Goldstein, 1973; Terhardt, 1974b; Wightman, 1973). This could be achieved by way of a template-matching process, with either hard-wired harmonic templates or templates that develop through repeated exposure to harmonic series, which eventually become associated with the F0. Temporal theories typically involve evaluating the time intervals between auditory-nerve spikes, using a form of autocorrelation or all-interval spike histogram (Cariani & Delgutte, 1996; Licklider, 1951; Meddis & Hewitt, 1991; Meddis & OMard, 1997; Schouten, Ritsma, & Cardozo, 1962). This information can be obtained from both resolved and unresolved harmonics.

1. The Perception of Musical Tones

15

Pooling these spikes from across the nerve array results in a dominant interval emerging that corresponds to the period of the waveform (i.e., the reciprocal of the F0). A third alternative involves using both place and temporal information. In one version, coincident timing between neurons with harmonically related CFs is postulated to lead to a spatial network of coincidence detectorsa place-based template that emerges through coincident timing information (Shamma & Klein, 2000). In another version, the impulse-response time of the auditory filters, which depends on the CF, is postulated to determine the range of periodicities that a certain tonotopic & Pressnitzer, 2006). Recent physiological studies location can code (de Cheveigne have supported at the least the plausibility of place-time mechanisms to code pitch (Cedolin & Delgutte, 2010). Distinguishing between place and temporal (or place-time) models of pitch has proved very difficult. In part, this is because spectral and temporal representations of a signal are mathematically equivalent: any change in the spectral representation will automatically lead to a change in the temporal representation, and vice versa. Psychoacoustic attempts to distinguish between place and temporal mechanisms have focused on the limits imposed by the peripheral physiology in the cochlea and auditory nerve. For instance, the limits of frequency selectivity can be used to test the place theory: if all harmonics are clearly unresolved (and therefore providing no place information) and a pitch is still heard, then pitch cannot depend solely on place information. Similarly, the putative limits of phase-locking can be used: if the periodicity of the waveform and the frequencies of all the resolved harmonics are all above the limit of phase locking in the auditory nerve and a pitch is still heard, then temporal information is unlikely to be necessary for pitch perception. A number of studies have shown that pitch perception is possible even when harmonic tone complexes are filtered to remove all the low-numbered, resolved harmonics (Bernstein & Oxenham, 2003; Houtsma & Smurzynski, 1990; Kaernbach & Bering, 2001; Shackleton & Carlyon, 1994). A similar conclusion was reached by studies that used amplitude-modulated broadband noise, which has no spectral peaks in its long-term spectrum (Burns & Viemeister, 1976, 1981). These results suggest that pitch can be extracted from temporal information alone, thereby ruling out theories that consider only place coding. However, the pitch sensation produced by unresolved harmonics or modulated noise is relatively weak compared with the pitch of musical instruments, which produce full harmonic complex tones. The more salient pitch that we normally associate with music is provided by the lower-numbered resolved harmonics. Studies that have investigated the relative contributions of individual harmonics have found that harmonics 3 to 5 (Moore, Glasberg, & Peters, 1985), or frequencies around 600 Hz (Dai, 2000), seem to have the most influence on the pitch of the overall complex. This is where current temporal models also encounter some difficulty: they are able to extract the F0 of a complex tone as well from unresolved harmonics as from resolved harmonics, and therefore they do not predict the large difference in pitch salience and accuracy between low- and high-numbered harmonics that is observed in psychophysical studies (Carlyon, 1998). In other words, place models do not predict good

16

Andrew J. Oxenham

enough performance with unresolved harmonics, whereas temporal models predict performance that is too good. The apparently qualitative and quantitative difference in the pitch produced by low-numbered and high-numbered harmonics has led to the suggestion that there may be two pitch mechanisms at work, one to code the temporal envelope repetition rate from high-numbered harmonics and one to code the F0 from the individual low-numbered harmonics (Carlyon & Shackleton, 1994), although subsequent work has questioned some of the evidence proposed for the two mechanisms (Gockel, Carlyon, & Plack, 2004; Micheyl & Oxenham, 2003). The fact that low-numbered, resolved harmonics are important suggests that place coding may play a role in everyday pitch. Further evidence comes from a variety of studies. The study mentioned earlier that used tones with low-frequency temporal information transposed into a high-frequency range (Oxenham et al., 2004) studied complex-tone pitch perception by transposing the information from harmonics 3, 4, and 5 of a 100-Hz F0 to high-frequency regions of the cochlea roughly 4 kHz, 6 kHz, and 10 kHz. If temporal information was sufficient to elicit a periodicity pitch, then listeners should have been able to hear a pitch corresponding to 100 Hz. In fact, none of the listeners reported hearing a low pitch or was able to match the pitch of the transposed tones to that of the missing fundamental. This suggests that, if temporal information is used, it may need to be presented to the correct place along the cochlea. Another line of evidence has come from revisiting early conclusions that no pitch is heard when all the harmonics are above about 5 kHz (Ritsma, 1962). The initial finding led researchers to suggest that timing information was crucial and that at frequencies above the limits of phase locking, periodicity pitch was not perceived. A recent study revisited this conclusion and found that, in fact, listeners were well able to hear pitches between 1 and 2 kHz, even when all the harmonics were filtered to be above 6 kHz, and were sufficiently resolved to ensure that no temporal envelope cues were available (Oxenham et al., 2011). This outcome leads to an interesting dissociation: tones above 6 kHz on their own do not produce a musically useful pitch; however, those same tones when combined with others in a harmonic series can produce a musical pitch sufficient to convey a melody. The results suggest that the upper limit of musical pitch may not in fact be explained by the upper limit of phase locking: the fact that pitch can be heard even when all tones are above 5 kHz suggests either that temporal information is not necessary for musical pitch or that usable phase locking in the human auditory nerve extends to much higher frequencies than currently believed (Heinz, Colburn, & Carney, 2001; Moore & Se k, 2009). A further line of evidence for the importance of place information has come from studies that have investigated the relationship between pitch accuracy and auditory filter bandwidths. Moore and Peters (1992) investigated the relationship between auditory filter bandwidths, measured using spectral masking techniques (Glasberg & Moore, 1990), pure-tone frequency discrimination, and complex-tone F0 discrimination in young and elderly people with normal and impaired hearing. People with hearing impairments were tested because they often have auditory filter bandwidths that are broader than normal. A wide range of results were foundsome

1. The Perception of Musical Tones

17

participants with normal filter bandwidths showed impaired pure-tone and complex-tone pitch discrimination thresholds; others with abnormally wide filters still had relatively normal pure-tone pitch discrimination thresholds. However, none of the participants with broadened auditory filters had normal F0 discrimination thresholds, suggesting that perhaps broader filters resulted in fewer or no resolved harmonics and that resolved harmonics are necessary for accurate F0 discrimination. This question was pursued later by Bernstein and Oxenham (2006a, 2006b), who systematically increased the lowest harmonic present in a harmonic complex tone and measured the point at which F0 discrimination thresholds worsened. In normal-hearing listeners, there is quite an abrupt transition from good to poor pitch discrimination as the lowest harmonic present is increased from the 9th to the 12th (Houtsma & Smurzynski, 1990). Bernstein and Oxenham reasoned that if the transition point is related to frequency selectivity and the resolvability of the harmonics, then the transition point should decrease to lower harmonic numbers as the auditory filters become wider. They tested this in hearing-impaired listeners and found a significant correlation between the transition point and the estimated bandwidth of the auditory filters (Bernstein & Oxenham, 2006b), suggesting that harmonics may need to be resolved in order to elicit a strong musical pitch. Interestingly, even though resolved harmonics may be necessary for accurate pitch perception, they may not be sufficient. Bernstein and Oxenham (2003) increased the number of resolved harmonics available to listeners by presenting alternating harmonics to opposite ears. In this way, the spacing between successive components in each ear was doubled, thereby doubling the number of peripherally resolved harmonics. Listeners were able to hear out about twice as many harmonics in this new condition, but that did not improve their pitch discrimination thresholds for the complex tone. In other words, providing access to harmonics that are not normally resolved does not improve pitch perception abilities. These results are consistent with theories that rely on pitch templates. If harmonics are not normally available to the auditory system, they would be unlikely to be incorporated into templates and so would not be expected to contribute to the pitch percept when presented by artificial means, such as presenting them to alternate ears. Most sounds in our world, including those produced by musical instruments, tend to have more energy at low frequencies than at high; on average, spectral amplitude decreases at a rate of about 1/f, or -6 dB/octave. It therefore makes sense that the auditory system would rely on the lower numbered harmonics to determine pitch, as these are the ones that are most likely to be audible. Also, resolved harmonicsones that produce a peak in the excitation pattern and elicit a sinusoidal temporal responseare much less susceptible to the effects of room reverberation than are unresolved harmonics. Pitch discrimination thresholds for unresolved harmonics are relatively good (B2%) when all the components have the same starting phase (as in a stream of pulses). However, thresholds are much worse when the phase relationships are scrambled, as they would be in a reverberant hall or church, and listeners discrimination thresholds can be as poor as 10%more than a musical semitone. In contrast, the response to resolved harmonics is not materially affected by reverberation: changing the starting phase of a single sinusoid does not affect its

18

Andrew J. Oxenham

waveshapeit still remains a sinusoid, with frequency discriminations thresholds of considerably less than 1%. A number of physiological and neuroimaging studies have searched for representations of pitch beyond the cochlea (Winter, 2005). Potential correlates of periodicity have been found in single- and multi-unit studies of the cochlear nucleus (Winter, Wiegrebe, & Patterson, 2001), in the inferior colliculus (Langner & Schreiner, 1988), and auditory cortex (Bendor & Wang, 2005). Human neuroimaging studies have also found correlates of periodicity in the brainstem (Griffiths, Uppenkamp, Johnsrude, Josephs, & Patterson, 2001) as well as in auditory cortical structures (Griffiths, Buchel, Frackowiak, & Patterson, 1998). More recently, Penagos, Melcher, and Oxenham (2004) identified a region in human auditory cortex that seemed sensitive to the degree of pitch salience, as opposed to physical parameters, such as F0 or spectral region. However, these studies are also not without some controversy. For instance, Hall and Plack (2009) failed to find any single region in the human auditory cortex that responded to pitch, independent of other stimulus parameters. Similarly, in a physiological study of the ferrets auditory cortex, Bizley, Walker, Silverman, King, and Schnupp (2009) found interdependent coding of pitch, timbre, and spatial location and did not find any pitch-specific region. In summary, the pitch of single harmonic complex tones is determined primarily by the first 5 to 8 harmonics, which are also those thought to be resolved in the peripheral auditory system. To extract the pitch, the auditory system must somehow combine and synthesize information from these harmonics. Exactly how this occurs in the auditory system remains a matter of ongoing research.

C. Timbre
The official ANSI definition of timbre is: That attribute of auditory sensation which enables a listener to judge that two nonidentical sounds, similarly presented and having the same loudness and pitch, are dissimilar (ANSI, 1994). The standard goes on to note that timbre depends primarily on the frequency spectrum of the sound, but can also depend on the sound pressure and temporal characteristics. In other words, anything that is not pitch or loudness is timbre. As timbre has its own chapter in this volume (Chapter 2), it will not be discussed further here. However, timbre makes an appearance in the next section, where its influence on pitch and loudness judgments is addressed.

D. Sensory Interactions and Cross-Modal Influences


The auditory sensations of loudness, pitch, and timbre are for the most part studied independently. Nevertheless, a sizeable body of evidence suggests that these sensory dimensions are not strictly independent. Furthermore, other sensory modalities, in particular vision, can have sizeable effects on auditory judgments of musical sounds.

1. The Perception of Musical Tones

19

Increasing brightness High F0, Low spectral peak High F0, High spectral peak

Increasing pitch

Level (dB)

Low F0, Low spectral peak

Low F0, High spectral peak

Frequency

Figure 5 Representations of F0 and spectral peak, which primarily affect the sensations of pitch and timbre, respectively.

1. Pitch and Timbre Interactions


Pitch and timbre are the two dimensions most likely to be confused, particularly by people without any musical training. Increasing the F0 of the complex tone results in an increase in pitch, whereas changing the spectral center of gravity of tone increases its brightnessone aspect of timbre (Figure 5). In both cases, when asked to describe the change, many listeners would simply say that the sound was higher. In general, listeners find it hard to ignore changes in timbre when making pitch judgments. Numerous studies have shown that the JND for F0 increases when the two sounds to be compared also vary in spectral content (e.g., Borchert, Micheyl, & Oxenham, 2011; Faulkner, 1985; Moore & Glasberg, 1990). In principle, this could be because the change in spectral shape actually affects pitch or because listeners have difficulty ignoring timbre changes and concentrating solely on pitch. Studies using pitch matching have generally found that harmonic complex tones are best matched with a pure-tone frequency corresponding to the F0, regardless of the spectral content of the complex tone (e.g., Patterson, 1973), which means that the detrimental effects of differing timbre may be related more to a distraction effect than to a genuine change in pitch (Moore & Glasberg, 1990).

2. Effects of Pitch or Timbre Changes on the Accuracy of Loudness Judgments


Just as listeners have more difficulty judging pitch in the face of varying timbre, loudness comparisons between two sounds become much more challenging when either the pitch or timbre of the two sounds differs. Examples include the difficulty of making loudness comparisons between two pure tones of different frequency

20

Andrew J. Oxenham

(Gabriel et al., 1997; Oxenham & Buus, 2000), and the difficulty of making loudness comparisons between tones of differing duration, even when they have the same frequency (Florentine, Buus, & Robinson, 1998).

3. Visual Influences on Auditory Attributes


As anyone who has watched a virtuoso musician will know, visual input affects the aesthetic experience of the audience. More direct influences of vision on auditory sensations, and vice versa, have also been reported in recent years. For instance, noise that is presented simultaneously with a light tends to be rated as louder than noise presented without light (Odgaard, Arieh, & Marks, 2004). Interestingly, this effect appears to be sensory in nature, rather than a late-stage decisional effect, or shift in criterion; in contrast, similar effects of noise on the apparent brightness of light (Stein, London, Wilkinson, & Price, 1996) seem to stem from higher-level decisional and criterion-setting mechanisms (Odgaard, Arieh, & Marks, 2003). On the other hand, recent combinations of behavioral and neuroimaging techniques have suggested that the combination of sound with light can result in increased sensitivity to low-level light, which is reflected in changes in activation of sensory cortices (Noesselt et al., 2010). Visual cues can also affect other attributes of sound. For instance, Schutz and colleagues (Schutz & Kubovy, 2009; Schutz & Lipscomb, 2007) have shown that the gestures made in musical performance can affect the perceived duration of a musical sound: a short or staccato gesture by a marimba player led to shorter judged durations of the tone than a long gesture by the player, even though the tone itself was identical. Interestingly, this did not hold for sustained sounds, such as a clarinet, where visual information had much less impact on duration judgments. The difference may relate to the exponential decay of percussive sounds, which have no clearly defined end, allowing the listeners to shift their criterion for the end point to better match the visual information.

III.

Perception of Sound Combinations

A. Object Perception and Grouping


When a musical tone, such as a violin note or a sung vowel, is presented, we normally hear a single sound with a single pitch, even though the note actually consists of many different pure tones, each with its own frequency and pitch. This perceptual fusion is partly because all the pure tones begin and end at roughly the same time, and partly because they form a single harmonic series (Darwin, 2005). The importance of onset and offset synchrony can be demonstrated by delaying one of the components relative to all the others. A delay of only a few tens of milliseconds is sufficient for the delayed component to pop out and be heard as a separate object. Similarly, if one component is mistuned compared to the rest of the complex, it will be heard out as a separate object, provided the mistuning is sufficiently large. For low-numbered harmonics, mistuning a harmonic by between 1 and 3% is

1. The Perception of Musical Tones

21

sufficient for it to pop out (Moore, Glasberg, & Peters, 1986). Interestingly, a mistuned harmonic can be heard separately, but can still contribute to the overall pitch of the complex; in fact a single mistuned harmonic continues to contribute to the overall pitch of the complex, even when it is mistuned by as much as 8% well above the threshold for hearing it out as a separate object (Darwin & Ciocca, 1992; Darwin, Hukin, & al-Khatib, 1995; Moore et al., 1985). This is an example of a failure of disjoint allocationa single component is not disjointly allocated to just a single auditory object (Liberman, Isenberg, & Rakerd, 1981; ShinnCunningham, Lee, & Oxenham, 2007).

B. Perceiving Multiple Pitches


How many tones can we hear at once? Considering all the different instruments in an orchestra, one might expect the number to be quite high, and a well-trained conductor will in many cases be able to hear a wrong note played by a single instrument within that orchestra. But are we aware of all the pitches being presented at once, and can we count them? Huron (1989) suggested that the number of independent voices we can perceive and count is actually rather low. Huron (1989) used sounds of homogenous timbre (organ notes) and played participants sections from a piece of polyphonic organ music by J. S. Bach with between one and five voices playing simultaneously. Despite the fact that most of the participants were musically trained, their ability to judge accurately the number of voices present decreased dramatically when the number of voices actually present exceeded three. Using much simpler stimuli, consisting of several simultaneous pure tones, Demany and Ramos (2005) made the interesting discovery that participants could not tell whether a certain tone was present or absent from the chord, but they noticed if its frequency was changed in the next presentation. In other words, listeners detected a change in the frequency of a tone that was itself undetected. Taken together with the results of Huron (1989), the data suggest that the pitches of many tones can be processed simultaneously, but that listeners may only be consciously aware of a subset of between three and four at any one time.

C. The Role of Frequency Selectivity in the Perception of Multiple Tones 1. Roughness


When two pure tones of differing frequency are added, the resulting waveform fluctuates in amplitude at a rate corresponding to the difference of the two frequencies. These amplitude fluctuations, or beats, are illustrated in Figure 6, which shows how the two tones are sometimes in phase, and add constructively (A), and sometimes out of phase, and so cancel (B). At beat rates of less than about 10 Hz, we hear the individual fluctuations, but once the rate increases above about 12 Hz, we are no longer able to follow the individual fluctuations and instead perceive a rough sound (Daniel & Weber, 1997; Terhardt, 1974a).

22

Andrew J. Oxenham

Figure 6 Illustration of the beats created by the summation of two sinusoids with slightly different frequencies. At some points in time, the two waveforms are in phase and so add constructively (A); at other points in time, the two waveforms are in antiphase and their waveforms cancel (B). The resulting waveform fluctuates at a rate corresponding to the difference of the two frequencies.

Time

According to studies of roughness, the percept is maximal at rates of around 70 Hz and then decreases. The decrease in perceived roughness with increasing rate is in part because the auditory system becomes less sensitive to modulation above about 100 to 150 Hz, and in part due to the effects of auditory filtering (Kohlrausch, Fassel, & Dau, 2000): If the two tones do not fall within the same auditory filter, the beating effect is reduced because the tones do not interact to form the complex waveform; instead (as with resolved harmonics) each tone is represented separately in the auditory periphery. Therefore, the perception of beats depends to a large extent on peripheral interactions in the ear. (Binaural beats also occur between sounds presented to opposite ears, but they are much less salient and are heard over a much smaller range of frequency differences; see Licklider, Webster, & Hedlun, 1950.) The percept of roughness that results from beats has been used to explain a number of musical phenomena. First, chords played in the lower registers typically sound muddy, and music theory calls for notes within a chord to be spaced further apart than in higher registers. This may be in part because the auditory filters are relatively wider at low frequencies (below about 500 Hz), leading to stronger peripheral interactions, and hence greater roughness, for tones that are spaced a constant musical interval apart. Second, it has been hypothesized that roughness underlies in part the attribute of dissonance that is used to describe unpleasant combinations of notes. The relationship between dissonance and beating is considered further in Section III,D.

1. The Perception of Musical Tones

23

2. Pitch Perception of Multiple Sounds


Despite the important role of tone combinations or chords in music, relatively few psychoacoustic studies have examined their perception. Beerends and Houtsma (1989) used complex tones consisting of just two consecutive harmonics each. Although the pitch of these two-component complexes is relatively weak, with practice, listeners can learn to accurately identify the F0 of such complexes. Beerends and Houtsma found that listeners were able to identify the pitches of the two complex tones, even if the harmonics from one sound were presented to different ears. The only exception was when all the components were presented to one ear and none of the four components was deemed to be resolved. In that case, listeners were not able to identify either pitch accurately. Carlyon (1996) used harmonic tone complexes with more harmonics and filtered them so that they had completely overlapping spectral envelopes. He found that when both complexes were composed of resolved harmonics, listeners were able to hear out the pitch of one complex in the presence of the other. However, the surprising finding was that when both complexes comprised only unresolved harmonics, then listeners did not hear a pitch at all, but described the percept as an unmusical crackle. To avoid ambiguity, Carlyon (1996) used harmonics that were either highly resolved or highly unresolved. Because of this, it remained unclear whether it is the resolvability of the harmonics before or after the two sounds are mixed that determines whether each tone elicits a clear pitch. Micheyl and colleagues addressed this issue, using a variety of combinations of spectral region and F0 to vary the relative resolvability of the components (Micheyl, Bernstein, & Oxenham, 2006; Micheyl, Keebler, & Oxenham, 2010). By comparing the results to simulations of auditory filtering, they found that good pitch discrimination was only possible when at least two of the harmonics from the target sound were deemed resolved after being mixed with the other sound (Micheyl et al., 2010). The results are consistent with place theories of pitch that rely on resolved harmonics; however, it may be possible to adapt timing-based models of pitch to similarly explain the phenomena (e.g., Bernstein & Oxenham, 2005).

D. Consonance and Dissonance


The question of how certain combinations of tones sound when played together is central to many aspects of music theory. Combinations of two tones that form certain musical intervals, such as the octave and the fifth, are typically deemed as sounding pleasant or consonant, whereas others, such as the augmented fourth (tritone), are often considered unpleasant or dissonant. These types of percepts involving tones presented in isolation of a musical context have been termed sensory consonance or dissonance. The term musical consonance (Terhardt, 1976, 1984) subsumes sensory factors, but also includes many other factors that contribute to whether a sound combination is judged as consonant or dissonant, including the context (what sounds preceded it), the style of music (e.g., jazz or classical), and presumably also the personal taste and musical history of the individual listener.

24

Andrew J. Oxenham

There has been a long-standing search for acoustic and physiological correlates of consonance and dissonance, going back to the observations of Pythagoras that strings whose lengths had a small-number ratio relationship (e.g., 2:1 or 3:2) sounded pleasant together. Helmholtz (1885/1954) suggested that consonance may be related to the absence of beats (perceived as roughness) in musical sounds. Plomp and Levelt (1965) developed the idea further by showing that the ranking by consonance of musical intervals within an octave was well predicted by the number of component pairs within the two complex tones that fell within the same auditory filters and therefore caused audible beats (see also Kameoka & Kuriyagawa, 1969a, 1969b). When two complex tones form a consonant interval, such as an octave or a fifth, the harmonics are either exactly coincident, and so do not produce beats, or are spaced so far apart as to not produce strong beats. In contrast, when the tones form a dissonant interval, such as a minor second, none of the components are coincident, but many are close enough to produce beats. Another alternative theory of consonance is based on the harmonicity of the sound combination, or how closely it resembles a single harmonic series. Consider, for instance, two complex tones that form the interval of a perfect fifth, with F0s of 440 and 660 Hz. All the components from both tones are multiples of a single F0220 Hzand so, according to the harmonicity account of consonance, should sound consonant. In contrast, the harmonics of two tones that form an augmented fourth, with F0s of 440 Hz and 622 Hz, do not approximate any single harmonic series within the range of audible pitches and so should sound dissonant, as found empirically. The harmonicity theory of consonance can be implemented by using a spectral template model (Terhardt, 1974b) or by using temporal information, derived for instance from spikes in the auditory nerve (Tramo, Cariani, Delgutte, & Braida, 2001). Because the beating and harmonicity theories of consonance and dissonance produce very similar predictions, it has been difficult to distinguish between them experimentally. A recent study took a step toward this goal by examining individual differences in a large group (.200) of participants (McDermott, Lehr, & Oxenham, 2010). First, listeners were asked to provide preference ratings for diagnostic stimuli that varied in beating but not harmonicity, or vice versa. Next, listeners were asked to provide preference ratings for various musical sound combinations, including dyads (two-note chords) and triads (three-note chords), using natural and artificial musical instruments and voices. When the ratings in the two types of tasks were compared, the correlations between the ratings for the harmonicity diagnostic tests and the musical sounds were significant, but the correlations between the ratings for the beating diagnostic tests and the musical sounds were not. Interestingly, the number of years of formal musical training also correlated with both the harmonicity and musical preference ratings, but not with the beating ratings. Overall, the results suggested that harmonicity, rather than lack of beating, underlies listeners consonance preferences and that musical training may amplify the preference for harmonic relationships. Developmental studies have shown that infants as young as 3 or 4 months show a preference for consonant over dissonant musical intervals (Trainor & Heinmiller,

1. The Perception of Musical Tones

25

1998; Zentner & Kagan, 1996, 1998). However, it is not yet known whether infants are responding more to beats or inharmonicity, or both. It would be interesting to discover whether the adult preferences for harmonicity revealed by McDermott et al. (2010) are shared by infants, or whether infants initially base their preferences on acoustic beats.

IV.

Conclusions and Outlook

Although the perception of musical tones should be considered primarily in musical contexts, much about the interactions between acoustics, auditory physiology, and perception can be learned through psychoacoustic experiments using relatively simple stimuli and procedures. Recent findings using psychoacoustics, alone or in combination with neurophysiology and neuroimaging, have extended our knowledge of how pitch, timbre, and loudness are perceived and represented neurally, both for tones in isolation and in combination. However, much still remains to be discovered. Important trends include the use of more naturalistic stimuli in experiments and for testing computational models of perception, as well as the simultaneous combination of perceptual and neural measures when attempting to elucidate the underlying neural mechanisms of auditory perception. Using the building blocks provided by the psychoacoustics of individual and simultaneous musical tones, it is possible to proceed to answering much more sophisticated questions regarding the perception of music as it unfolds over time. These and other issues are tackled in the remaining chapters of this volume.

Acknowledgments
Emily Allen, Christophe Micheyl, and John Oxenham provided helpful comments on an earlier version of this chapter. The work from the authors laboratory is supported by funding from the National Institutes of Health (Grants R01 DC 05216 and R01 DC 07657).

References
American National Standards Institute. (1994). Acoustical terminology. ANSI S1.1-1994. New York, NY: Author. Arieh, Y., & Marks, L. E. (2003a). Recalibrating the auditory system: A speed-accuracy analysis of intensity perception. Journal of Experimental Psychology: Human Perception and Performance, 29, 523536. Arieh, Y., & Marks, L. E. (2003b). Time course of loudness recalibration: Implications for loudness enhancement. Journal of the Acoustical Society of America, 114, 15501556. Attneave, F., & Olson, R. K. (1971). Pitch as a medium: A new approach to psychophysical scaling. American Journal of Psychology, 84, 147166.

26

Andrew J. Oxenham

Beerends, J. G., & Houtsma, A. J. M. (1989). Pitch identification of simultaneous diotic and dichotic two-tone complexes. Journal of the Acoustical Society of America, 85, 813819. Bendor, D., & Wang, X. (2005). The neuronal representation of pitch in primate auditory cortex. Nature, 436, 11611165. Bernstein, J. G., & Oxenham, A. J. (2003). Pitch discrimination of diotic and dichotic tone complexes: Harmonic resolvability or harmonic number? Journal of the Acoustical Society of America, 113, 33233334. Bernstein, J. G., & Oxenham, A. J. (2005). An autocorrelation model with place dependence to account for the effect of harmonic number on fundamental frequency discrimination. Journal of the Acoustical Society of America, 117, 38163831. Bernstein, J. G., & Oxenham, A. J. (2006a). The relationship between frequency selectivity and pitch discrimination: Effects of stimulus level. Journal of the Acoustical Society of America, 120, 39163928. Bernstein, J. G., & Oxenham, A. J. (2006b). The relationship between frequency selectivity and pitch discrimination: Sensorineural hearing loss. Journal of the Acoustical Society of America, 120, 39293945. Bizley, J. K., Walker, K. M., Silverman, B. W., King, A. J., & Schnupp, J. W. (2009). Interdependent encoding of pitch, timbre, and spatial location in auditory cortex. Journal of Neuroscience, 29, 20642075. Borchert, E. M., Micheyl, C., & Oxenham, A. J. (2011). Perceptual grouping affects pitch judgments across time and frequency. Journal of Experimental Psychology: Human Perception and Performance, 37, 257269. Burns, E. M., & Viemeister, N. F. (1976). Nonspectral pitch. Journal of the Acoustical Society of America, 60, 863869. Burns, E. M., & Viemeister, N. F. (1981). Played again SAM: Further observations on the pitch of amplitude-modulated noise. Journal of the Acoustical Society of America, 70, 16551660. Buus, S., Muesch, H., & Florentine, M. (1998). On loudness at threshold. Journal of the Acoustical Society of America, 104, 399410. Cariani, P. A., & Delgutte, B. (1996). Neural correlates of the pitch of complex tones. I. Pitch and pitch salience. Journal of Neurophysiology, 76, 16981716. Carlyon, R. P. (1996). Encoding the fundamental frequency of a complex tone in the presence of a spectrally overlapping masker. Journal of the Acoustical Society of America, 99, 517524. Carlyon, R. P. (1998). Comments on A unitary model of pitch perception [Journal of the Acoustical Society of America, 102, 18111820 (1997)]. Journal of the Acoustical Society of America, 104, 11181121. Carlyon, R. P., & Shackleton, T. M. (1994). Comparing the fundamental frequencies of resolved and unresolved harmonics: Evidence for two pitch mechanisms? Journal of the Acoustical Society of America, 95, 35413554. Cedolin, L., & Delgutte, B. (2010). Spatiotemporal representation of the pitch of harmonic complex tones in the auditory nerve. Journal of Neuroscience, 30, 1271212724. Chalupper, J., & Fastl, H. (2002). Dynamic loudness model (DLM) for normal and hearingimpaired listeners. Acta Acustica united with Acustica, 88, 378386. Chen, Z., Hu, G., Glasberg, B. R., & Moore, B. C. (2011). A new method of calculating auditory excitation patterns and loudness for steady sounds. Hearing Research, 282 (12), 204215.

1. The Perception of Musical Tones

27

Cohen, M. A., Grossberg, S., & Wyse, L. L. (1995). A spectral network model of pitch perception. Journal of the Acoustical Society of America, 98, 862879. Dai, H. (2000). On the relative influence of individual harmonics on pitch judgment. Journal of the Acoustical Society of America, 107, 953959. Daniel, P., & Weber, R. (1997). Psychoacoustical roughness: Implementation of an optimized model. Acustica, 83, 113123. Darwin, C. J. (2005). Pitch and auditory grouping. In C. J. Plack, A. J. Oxenham, R. Fay, & A. N. Popper (Eds.), Pitch: Neural coding and perception (pp. 278305). New York, NY: Springer Verlag. Darwin, C. J., & Ciocca, V. (1992). Grouping in pitch perception: Effects of onset asynchrony and ear of presentation of a mistuned component. Journal of the Acoustical Society of America, 91, 33813390. Darwin, C. J., Hukin, R. W., & al-Khatib, B. Y. (1995). Grouping in pitch perception: Evidence for sequential constraints. Journal of the Acoustical Society of America, 98, 880885. de Boer, E. (1956). On the residue in hearing (Unpublished doctoral dissertation). The Netherlands: University of Amsterdam. , A. (2005). Pitch perception models. In C. J. Plack, A. J. Oxenham, de Cheveigne A. N. Popper, & R. Fay (Eds.), Pitch: Neural coding and perception (pp. 169233). New York, NY: Springer Verlag. , A., & Pressnitzer, D. (2006). The case of the missing delay lines: Synthetic de Cheveigne delays obtained by cross-channel phase interaction. Journal of the Acoustical Society of America, 119, 39083918. Demany, L., & Ramos, C. (2005). On the binding of successive sounds: perceiving shifts in nonperceived pitches. Journal of the Acoustical Society of America, 117, 833841. Durlach, N. I., & Braida, L. D. (1969). Intensity perception. I. Preliminary theory of intensity resolution. Journal of the Acoustical Society of America, 46, 372383. Epstein, M., & Florentine, M. (2005). A test of the equal-loudness-ratio hypothesis using cross-modality matching functions. Journal of the Acoustical Society of America, 118, 907913. Faulkner, A. (1985). Pitch discrimination of harmonic complex signals: Residue pitch or multiple component discriminations. Journal of the Acoustical Society of America, 78, 19932004. Fechner, G. T. (1860). Elemente der psychophysik (Vol. 1). Leipzig, Germany: Breitkopf und Haertl. Fletcher, H., & Munson, W. A. (1933). Loudness, its definition, measurement and calculation. Journal of the Acoustical Society of America, 5, 82108. Fletcher, H., & Munson, W. A. (1937). Relation between loudness and masking. Journal of the Acoustical Society of America, 9, 110. Florentine, M., Buus, S., & Robinson, M. (1998). Temporal integration of loudness under partial masking. Journal of the Acoustical Society of America, 104, 9991007. Gabriel, B., Kollmeier, B., & Mellert, V. (1997). Influence of individual listener, measurement room and choice of test-tone levels on the shape of equal-loudness level contours. Acustica, 83, 670683. Galambos, R., Bauer, J., Picton, T., Squires, K., & Squires, N. (1972). Loudness enhancement following contralateral stimulation. Journal of the Acoustical Society of America, 52(4), 11271130.

28

Andrew J. Oxenham

Glasberg, B. R., & Moore, B. C. J. (1990). Derivation of auditory filter shapes from notched-noise data. Hearing Research, 47, 103138. Glasberg, B. R., & Moore, B. C. J. (2002). A model of loudness applicable to time-varying sounds. Journal of the Audio Engineering Society, 50, 331341. Gockel, H., Carlyon, R. P., & Plack, C. J. (2004). Across-frequency interference effects in fundamental frequency discrimination: Questioning evidence for two pitch mechanisms. Journal of the Acoustical Society of America, 116, 10921104. Goldstein, J. L. (1973). An optimum processor theory for the central formation of the pitch of complex tones. Journal of the Acoustical Society of America, 54, 14961516. Griffiths, T. D., Buchel, C., Frackowiak, R. S., & Patterson, R. D. (1998). Analysis of temporal structure in sound by the human brain. Nature Neuroscience, 1, 422427. Griffiths, T. D., Uppenkamp, S., Johnsrude, I., Josephs, O., & Patterson, R. D. (2001). Encoding of the temporal regularity of sound in the human brainstem. Nature Neuroscience, 4, 633637. Hall, D. A., & Plack, C. J. (2009). Pitch processing sites in the human auditory brain. Cerebral Cortex, 19, 576585. Hartmann, W. M., & Goupell, M. J. (2006). Enhancing and unmasking the harmonics of a complex tone. Journal of the Acoustical Society of America, 120, 21422157. Heinz, M. G., Colburn, H. S., & Carney, L. H. (2001). Evaluating auditory performance limits: I. One-parameter discrimination using a computational model for the auditory nerve. Neural Computation, 13, 22732316. Hellman, R. P. (1976). Growth of loudness at 1000 and 3000 Hz. Journal of the Acoustical Society of America, 60, 672679. Hellman, R. P., & Zwislocki, J. (1964). Loudness function of a 1000-cps tone in the presence of a masking noise. Journal of the Acoustical Society of America, 36, 16181627. Helmholtz, H. L. F. (1885/1954). On the sensations of tone (A. J. Ellis, Trans.). New York, NY: Dover. Henning, G. B. (1966). Frequency discrimination of random amplitude tones. Journal of the Acoustical Society of America, 39, 336339. Houtsma, A. J. M., & Smurzynski, J. (1990). Pitch identification and discrimination for complex tones with many harmonics. Journal of the Acoustical Society of America, 87, 304310. Huron, D. (1989). Voice denumerability in polyphonic music of homogenous timbres. Music Perception, 6, 361382. Jesteadt, W., Wier, C. C., & Green, D. M. (1977). Intensity discrimination as a function of frequency and sensation level. Journal of the Acoustical Society of America, 61, 169177. Kaernbach, C., & Bering, C. (2001). Exploring the temporal mechanism involved in the pitch of unresolved harmonics. Journal of the Acoustical Society of America, 110, 10391048. Kameoka, A., & Kuriyagawa, M. (1969a). Consonance theory part I: Consonance of dyads. Journal of the Acoustical Society of America, 45, 14511459. Kameoka, A., & Kuriyagawa, M. (1969b). Consonance theory part II: Consonance of complex tones and its calculation method. Journal of the Acoustical Society of America, 45, 14601469. Keuss, P. J., & van der Molen, M. W. (1982). Positive and negative effects of stimulus intensity in auditory reaction tasks: Further studies on immediate arousal. Acta Psychologica, 52, 6172.

1. The Perception of Musical Tones

29

Kohfeld, D. L. (1971). Simple reaction time as a function of stimulus intensity in decibels of light and sound. Journal of Experimental Psychology, 88, 251257. Kohlrausch, A., Fassel, R., & Dau, T. (2000). The influence of carrier level and frequency on modulation and beat-detection thresholds for sinusoidal carriers. Journal of the Acoustical Society of America, 108, 723734. Langner, G., & Schreiner, C. E. (1988). Periodicity coding in the inferior colliculus of the cat. I. Neuronal mechanisms. Journal of Neurophysiology, 60, 17991822. Liberman, A. M., Isenberg, D., & Rakerd, B. (1981). Duplex perception of cues for stop consonants: Evidence for a phonetic mode. Perception & Psychophysics, 30, 133143. Licklider, J. C., Webster, J. C., & Hedlun, J. M. (1950). On the frequency limits of binaural beats. Journal of the Acoustical Society of America, 22, 468473. Licklider, J. C. R. (1951). A duplex theory of pitch perception. Experientia, 7, 128133. Loeb, G. E., White, M. W., & Merzenich, M. M. (1983). Spatial cross correlation: A proposed mechanism for acoustic pitch perception. Biological Cybernetics, 47, 149163. Luce, R. D., & Green, D. M. (1972). A neural timing theory for response times and the psychophysics of intensity. Psychological Review, 79, 1457. Mapes-Riordan, D., & Yost, W. A. (1999). Loudness recalibration as a function of level. Journal of the Acoustical Society of America, 106, 35063511. Marks, L. E. (1994). Recalibrating the auditory system: The perception of loudness. Journal of Experimental Psychology: Human Perception and Performance, 20, 382396. Mauermann, M., Long, G. R., & Kollmeier, B. (2004). Fine structure of hearing threshold and loudness perception. Journal of the Acoustical Society of America, 116, 10661080. McDermott, J. H., Lehr, A. J., & Oxenham, A. J. (2010). Individual differences reveal the basis of consonance. Current Biology, 20, 10351041. Meddis, R., & Hewitt, M. (1991). Virtual pitch and phase sensitivity studied of a computer model of the auditory periphery. I: Pitch identification. Journal of the Acoustical Society of America, 89, 28662882. Meddis, R., & OMard, L. (1997). A unitary model of pitch perception. Journal of the Acoustical Society of America, 102, 18111820. Micheyl, C., Bernstein, J. G., & Oxenham, A. J. (2006). Detection and F0 discrimination of harmonic complex tones in the presence of competing tones or noise. Journal of the Acoustical Society of America, 120, 14931505. Micheyl, C., Delhommeau, K., Perrot, X., & Oxenham, A. J. (2006). Influence of musical and psychoacoustical training on pitch discrimination. Hearing Research, 219, 3647. Micheyl, C., Keebler, M. V., & Oxenham, A. J. (2010). Pitch perception for mixtures of spectrally overlapping harmonic complex tones. Journal of the Acoustical Society of America, 128, 257269. Micheyl, C., & Oxenham, A. J. (2003). Further tests of the two pitch mechanisms hypothesis. Journal of the Acoustical Society of America, 113, 2225. Miller, G. A. (1956). The magic number seven, plus or minus two: Some limits on our capacity for processing information. Psychology Review, 63, 8196. Moore, B. C. J. (1973). Frequency difference limens for short-duration tones. Journal of the Acoustical Society of America, 54, 610619. Moore, B. C. J., & Glasberg, B. R. (1990). Frequency discrimination of complex tones with overlapping and non-overlapping harmonics. Journal of the Acoustical Society of America, 87, 21632177.

30

Andrew J. Oxenham

Moore, B. C. J., & Glasberg, B. R. (1996). A revision of Zwickers loudness model. Acustica, 82, 335345. Moore, B. C. J., & Glasberg, B. R. (1997). A model of loudness perception applied to cochlear hearing loss. Auditory Neuroscience, 3, 289311. Moore, B. C. J., Glasberg, B. R., & Baer, T. (1997). A model for the prediction of thresholds, loudness, and partial loudness. Journal of the Audio Engineering Society, 45, 224240. Moore, B. C. J., Glasberg, B. R., & Peters, R. W. (1985). Relative dominance of individual partials in determining the pitch of complex tones. Journal of the Acoustical Society of America, 77, 18531860. Moore, B. C. J., Glasberg, B. R., & Peters, R. W. (1986). Thresholds for hearing mistuned partials as separate tones in harmonic complexes. Journal of the Acoustical Society of America, 80, 479483. Moore, B. C. J., Glasberg, B. R., & Vickers, D. A. (1999). Further evaluation of a model of loudness perception applied to cochlear hearing loss. Journal of the Acoustical Society of America, 106, 898907. Moore, B. C. J., & Gockel, H. E. (2011). Resolvability of components in complex tones and implications for theories of pitch perception. Hearing Research, 276, 8897. Moore, B. C. J., & Peters, R. W. (1992). Pitch discrimination and phase sensitivity in young and elderly subjects and its relationship to frequency selectivity. Journal of the Acoustical Society of America, 91, 28812893. Moore, B. C. J., & Se k, A. (2009). Sensitivity of the human auditory system to temporal fine structure at high frequencies. Journal of the Acoustical Society of America, 125, 31863193. Noesselt, T., Tyll, S., Boehler, C. N., Budinger, E., Heinze, H. J., & Driver, J. (2010). Sound-induced enhancement of low-intensity vision: Multisensory influences on human sensory-specific cortices and thalamic bodies relate to perceptual enhancement of visual detection sensitivity. Journal of Neuroscience, 30, 1360913623. Oberfeld, D. (2007). Loudness changes induced by a proximal sound: Loudness enhancement, loudness recalibration, or both? Journal of the Acoustical Society of America, 121, 21372148. Odgaard, E. C., Arieh, Y., & Marks, L. E. (2003). Cross-modal enhancement of perceived brightness: Sensory interaction versus response bias. Perception & Psychophysics, 65, 123132. Odgaard, E. C., Arieh, Y., & Marks, L. E. (2004). Brighter noise: Sensory enhancement of perceived loudness by concurrent visual stimulation. Cognitive, Affective, & Behavioral Neuroscience, 4, 127132. Oxenham, A. J., Bernstein, J. G. W., & Penagos, H. (2004). Correct tonotopic representation is necessary for complex pitch perception. Proceedings of the National Academy of Sciences USA, 101, 14211425. Oxenham, A. J., & Buus, S. (2000). Level discrimination of sinusoids as a function of duration and level for fixed-level, roving-level, and across-frequency conditions. Journal of the Acoustical Society of America, 107, 16051614. Oxenham, A. J., Micheyl, C., Keebler, M. V., Loper, A., & Santurette, S. (2011). Pitch perception beyond the traditional existence region of pitch. Proceedings of the National Academy of Sciences USA, 108, 76297634. Palmer, A. R., & Russell, I. J. (1986). Phase-locking in the cochlear nerve of the guinea-pig and its relation to the receptor potential of inner hair-cells. Hearing Research, 24, 115.

1. The Perception of Musical Tones

31

Patterson, R. D. (1973). The effects of relative phase and the number of components on residue pitch. Journal of the Acoustical Society of America, 53, 15651572. Penagos, H., Melcher, J. R., & Oxenham, A. J. (2004). A neural representation of pitch salience in non-primary human auditory cortex revealed with fMRI. Journal of Neuroscience, 24, 68106815. Plack, C. J. (1996). Loudness enhancement and intensity discrimination under forward and backward masking. Journal of the Acoustical Society of America, 100, 10241030. Plack, C. J., Oxenham, A. J., Popper, A. N., & Fay, R. (Eds.), (2005). Pitch: Neural coding and perception. New York, NY: Springer Verlag. Plomp, R., & Levelt, W. J. M. (1965). Tonal consonance and critical bandwidth. Journal of the Acoustical Society of America, 38, 548560. Poulton, E. C. (1977). Quantitative subjective assessments are almost always biased, sometimes completely misleading. British Journal of Psychology, 68, 409425. Poulton, E. C. (1979). Models for the biases in judging sensory magnitude. Psychology Bulletin, 86, 777803. Pressnitzer, D., Patterson, R. D., & Krumbholz, K. (2001). The lower limit of melodic pitch. Journal of the Acoustical Society of America, 109, 20742084. Relkin, E. M., & Doucet, J. R. (1997). Is loudness simply proportional to the auditory nerve spike count? Journal of the Acoustical Society of America, 101, 27352741. Ritsma, R. J. (1962). Existence region of the tonal residue. I. Journal of the Acoustical Society of America, 34, 12241229. Robinson, D. W., & Dadson, R. S. (1956). A re-determination of the equal-loudness relations for pure tones. British Journal of Applied Physics, 7, 166181. Rose, J. E., Brugge, J. F., Anderson, D. J., & Hind, J. E. (1967). Phase-locked response to low-frequency tones in single auditory nerve fibers of the squirrel monkey. Journal of Neurophysiology, 30, 769793. Scharf, B. (1964). Partial masking. Acustica, 14, 1623. Scharf, B., Buus, S., & Nieder, B. (2002). Loudness enhancement: Induced loudness reduction in disguise? (L). Journal of the Acoustical Society of America, 112, 807810. Schouten, J. F. (1940). The residue and the mechanism of hearing. Proceedings of the Koninklijke Nederlandse Academie van Wetenschappen, 43, 991999. Schouten, J. F., Ritsma, R. J., & Cardozo, B. L. (1962). Pitch of the residue. Journal of the Acoustical Society of America, 34, 14181424. Schutz, M., & Kubovy, M. (2009). Causality and cross-modal integration. Journal of Experimental Psychology: Human Perception and Performance, 35, 17911810. Schutz, M., & Lipscomb, S. (2007). Hearing gestures, seeing music: Vision influences perceived tone duration. Perception, 36, 888897. nen. ber einige Bedingungen der Entstehung von To Seebeck, A. (1841). Beobachtungen u Annals of Physical Chemistry, 53, 417436. Shackleton, T. M., & Carlyon, R. P. (1994). The role of resolved and unresolved harmonics in pitch perception and frequency modulation discrimination. Journal of the Acoustical Society of America, 95, 35293540. Shamma, S., & Klein, D. (2000). The case of the missing pitch templates: How harmonic templates emerge in the early auditory system. Journal of the Acoustical Society of America, 107, 26312644. Shinn-Cunningham, B. G., Lee, A. K., & Oxenham, A. J. (2007). A sound element gets lost in perceptual competition. Proceedings of the National Academy of Sciences USA, 104, 1222312227.

32

Andrew J. Oxenham

Shofner, W. P. (2005). Comparative aspects of pitch perception. In C. J. Plack, A. J. Oxenham, R. Fay, & A. N. Popper (Eds.), Pitch: Neural coding and perception (pp. 5698). New York, NY: Springer Verlag. Stein, B. E., London, N., Wilkinson, L. K., & Price, D. D. (1996). Enhancement of perceived visual intensity by auditory stimuli: A psychophysical analysis. Journal of Cognitive Neuroscience, 8, 497506. Stevens, S. S. (1957). On the psychophysical law. Psychology Review, 64, 153181. Suzuki, Y., & Takeshima, H. (2004). Equal-loudness-level contours for pure tones. Journal of the Acoustical Society of America, 116, 918933. Terhardt, E. (1974a). On the perception of periodic sound fluctuations (roughness). Acustica, 30, 201213. Terhardt, E. (1974b). Pitch, consonance, and harmony. Journal of the Acoustical Society of America, 55, 10611069. ndetes Konzept der musikalischen Konsonanz. Terhardt, E. (1976). Psychoakustich begru Acustica, 36, 121137. Terhardt, E. (1984). The concept of musical consonance, a link between music and psychoacoustics. Music Perception, 1, 276295. Trainor, L. J., & Heinmiller, B. M. (1998). The development of evaluative responses to music: Infants prefer to listen to consonance over dissonance. Infant Behavior and Development, 21, 7788. Tramo, M. J., Cariani, P. A., Delgutte, B., & Braida, L. D. (2001). Neurobiological foundations for the theory of harmony in western tonal music. Annals of the New York Academy of Sciences, 930, 92116. van de Par, S., & Kohlrausch, A. (1997). A new approach to comparing binaural masking level differences at low and high frequencies. Journal of the Acoustical Society of America, 101, 16711680. Verschuure, J., & van Meeteren, A. A. (1975). The effect of intensity on pitch. Acustica, 32, 3344. Viemeister, N. F. (1983). Auditory intensity discrimination at high frequencies in the presence of noise. Science, 221, 12061208. Viemeister, N. F., & Bacon, S. P. (1988). Intensity discrimination, increment detection, and magnitude estimation for 1-kHz tones. Journal of the Acoustical Society of America, 84, 172178. Wallace, M. N., Rutkowski, R. G., Shackleton, T. M., & Palmer, A. R. (2000). Phase-locked responses to pure tones in guinea pig auditory cortex. Neuroreport, 11, 39893993. Warren, R. M. (1970). Elimination of biases in loudness judgements for tones. Journal of the Acoustical Society of America, 48, 13971403. Wightman, F. L. (1973). The pattern-transformation model of pitch. Journal of the Acoustical Society of America, 54, 407416. Winckel, F. W. (1962). Optimum acoustic criteria of concert halls for the performance of classical music. Journal of the Acoustical Society of America, 34, 8186. Winter, I. M. (2005). The neurophysiology of pitch. In C. J. Plack, A. J. Oxenham, R. Fay, & A. N. Popper (Eds.), Pitch: Neural coding and perception (pp. 99146). New York, NY: Springer Verlag. Winter, I. M., Wiegrebe, L., & Patterson, R. D. (2001). The temporal representation of the delay of iterated rippled noise in the ventral cochlear nucleus of the guinea-pig. Journal of Physiology, 537, 553566. Zentner, M. R., & Kagan, J. (1996). Perception of music by infants. Nature, 383, 29.

1. The Perception of Musical Tones

33

Zentner, M. R., & Kagan, J. (1998). Infants perception of consonance and dissonance in music. Infant Behavior and Development, 21, 483492. rke. Acustica, 10, 304308. Zwicker, E. (1960). Ein Verfahren zur Berechnung der Lautsta ber die Lautheit von ungedrosselten und gedrosselten Schallen. Zwicker, E. (1963). U Acustica, 13, 194211. Zwicker, E., Fastl, H., & Dallmayr, C. (1984). BASIC-Program for calculating the loudness of sounds from their 1/3-oct. band spectra according to ISO 522B. Acustica, 55, 6367.

2 Musical Timbre Perception


Stephen McAdams
McGill University, Montreal, Quebec, Canada Timbre is a misleadingly simple and exceedingly vague word encompassing a very complex set of auditory attributes, as well as a plethora of intricate psychological and musical issues. It covers many parameters of perception that are not accounted for by pitch, loudness, spatial position, duration, or even by various environmental characteristics such as room reverberation. This leaves myriad possibilities, some of which have been explored during the past 40 years or so. We now understand timbre to have two broad characteristics that contribute to the perception of music: (1) it is a multitudinous set of perceptual attributes, some of which are continuously varying (e.g., attack sharpness, brightness, nasality, richness), others of which are discrete or categorical (e.g., the blatt at the beginning of a sforzando trombone sound or the pinched offset of a harpsichord sound), and (2) it is one of the primary perceptual vehicles for the recognition, identification, and tracking over time of a sound source (singers voice, clarinet, set of carillon bells) and thus is involved in the absolute categorization of a sounding object (Hajda, Kendall, Carterette & Harshberger, 1997; Handel, 1995; McAdams, 1993; Risset, 2004). Understanding the perception of timbre thus covers a wide range of issues from determining the properties of vibrating objects and of the acoustic waves emanating from them, developing techniques for quantitatively analyzing and characterizing sound waves, formalizing models of how the acoustic signal is analyzed and coded neurally by the auditory system, characterizing the perceptual representation of the sounds used by listeners to compare sounds in an abstract way or to categorize or identify their physical source, to understanding the role that timbre can play in perceiving musical patterns and forms and shaping musical performance expressively. More theoretical approaches to timbre have also included considerations of the musical implications of timbre as a set of form-bearing dimensions in music (cf. McAdams, 1989). This chapter will focus on some of these issues in detail: the psychophysics of timbre, timbre as a vehicle for source identity, the role of timbre in musical grouping, and timbre as a structuring force in music perception, including the effect of sound blending on the perception of timbre, timbres role in the grouping of events into streams and musical patterns, the perception of timbral intervals, the role of timbre in the building and release of musical tension, and implicit learning of timbral grammars. A concluding section will examine a number of issues that have not been extensively studied yet concerning the role of timbre
The Psychology of Music. DOI: http://dx.doi.org/10.1016/B978-0-12-381460-9.00002-X 2013 Elsevier Inc. All rights reserved.

36

Stephen McAdams

characterization in music information retrieval systems, control of timbral variation by instrumentalists and sound synthesis control devices to achieve musical expressiveness, the link between timbre perception and cognition and orchestration and electroacoustic music composition, and finally, consideration of timbres status as a primary or secondary parameter in musical structure.1

I.

Psychophysics of Timbre

One of the main approaches to timbre perception attempts to characterize quantitatively the ways in which sounds are perceived to differ. Early research on the perceptual nature of timbre focused on preconceived aspects such as the relative weights of different frequencies present in a given sound, or its sound color (Slawson, 1985). For example, both a voice singing a constant middle C while varying the vowel being sung and a brass player holding a given note while varying the embouchure and mouth cavity shape would vary the shape of the sound spectrum (cf. McAdams, Depalle & Clarke, 2004). Helmholtz (1885/1954) invented some rather ingenious resonating devices for controlling spectral shape to explore these aspects of timbre. However, the real advances in understanding the perceptual representation of timbre had to wait for the development of signal generation and processing techniques and of multidimensional data analysis techniques in the 1950s and 1960s. Plomp (1970) and Wessel (1973) were the first to apply these to timbre perception.

A. Timbre Space
Multidimensional scaling (MDS) makes no preconceptions about the physical or perceptual structure of timbre. Listeners simply rate on a scale varying from very similar to very dissimilar all pairs from a given set of sounds. The sounds are usually equalized in terms of pitch, loudness, and duration and are presented from the same location in space so that only the timbre varies in order to focus listeners attention on this set of attributes. The dissimilarity ratings are then fit to a distance model in which sounds with similar timbres are closer together and those with dissimilar timbres are farther apart. The analysis approach is presented in Figure 1. The graphic representation of the distance model is called a timbre space. Such techniques have been applied to synthetic sounds (Miller & Carterette, 1975; Plomp, 1970; Caclin, McAdams, Smith & Winsberg, 2005), resynthesized or simulated instrument sounds (Grey, 1977; Kendall, Carterette, & Hajda, 1999; Krumhansl, 1989; McAdams, Winsberg, Donnadieu, De Soete & Krimphoff, 1995; Wessel, 1979), recorded instrument sounds (Iverson & Krumhansl, 1993; Lakatos,
1

In contrast to the chapter on timbre in the previous editions of this book, less emphasis will be placed on sound analysis and synthesis and more on perception and cognition. Risset and Wessel (1999) remains an excellent summary of these former issues.

2. Musical Timbre Perception

37

Figure 1 Stages in the multidimensional analysis of dissimilarity ratings of sounds differing in timbre.

2000; Wessel, 1973), and even dyads of recorded instrument sounds (Kendall & Carterette, 1991; Tardieu & McAdams, in press). The basic MDS model, such as Kruskals (1964a, 1964b) nonmetric model, is expressed in terms of continuous dimensions that are shared among the timbres, the underlying assumption being that all listeners use the same perceptual dimensions to compare the timbres. The model distances are fit to the empirically derived proximity data (usually dissimilarity ratings or confusion ratings among sounds). More complex models also include dimensions or features that are specific to individual timbres, called specificities (EXSCAL, Winsberg & Carroll, 1989) and different perceptual weights accorded to the dimensions and specificities by individual listeners or latent classes of listeners (INDSCAL, Carroll & Chang, 1970; CLASCAL, Winsberg & De Soete, 1993; McAdams et al., 1995). The equation defining distance in the more general CLASCAL model is the following: " dijt 5
R X r51

#1 2 wtr xir 2 xjr 1 vt si 1 sj ;


2

(Eq. 1)

where dijt is the distance between sounds i and j for latent class t, xir is the coordinate of sound i on dimension r, R is the total number of dimensions, wtr is the weight on dimension r for class t, si is the specificity on sound i, and vt is the weight on the whole set of specificities for class t. The basic model doesnt have

38

Stephen McAdams

weights or specificities and has only one class of listeners. EXCAL has specificities, but no weights. For INDSCAL, the number of latent classes is equal to the number of listeners. Finally, the CONSCAL model allows for continuous mapping functions between audio descriptors and the position of sounds along a perceptual dimension to be modeled for each listener by using spline functions, with the proviso that the position along the perceptual dimension respect the ordering along the physical dimension (Winsberg & De Soete, 1997). This technique allows one to determine the auditory transform of each physical parameter for each listener. Examples of the use of these different analysis models include Kruskals technique by Plomp (1970), INDSCAL by Wessel (1973) and Grey (1977), EXSCAL by Krumhansl (1989), CLASCAL by McAdams et al. (1995) and CONSCAL by Caclin et al. (2005). Descriptions of how to use the CLASCAL and CONSCAL models in the context of timbre research are provided in McAdams et al. (1995) and Caclin et al. (2005), respectively. Specificities are often found for complex acoustic and synthesized sounds. They are considered to represent the presence of a unique feature that distinguishes a sound from all others in a given context. For example, in a set of brass, woodwind, and string sounds, a harpsichord has a feature shared with no other sound: the return of the hopper, which creates a slight thump and quickly damps the sound at the end. Or in a set of sounds with fairly smooth spectral envelopes such as brass instruments, the jagged spectral envelope of the clarinet due to the attenuation of the even harmonics at lower harmonic ranks would be a feature specific to that instrument. Such features might appear as specificities in the EXSCAL and CLASCAL distance models (Krumhansl, 1989; McAdams et al., 1995), and the strength of each feature is represented by the square root of the specificity value in Equation 1. Some models include individual and class differences as weighting factors on the different dimensions and the set of specificities. For example, some listeners might pay more attention to spectral properties than to temporal aspects, whereas others might have the inverse pattern. Such variability could reflect either differences in sensory processing or in listening and rating strategies. Interestingly, no study to date has demonstrated that such individual differences have anything to do with musical experience or training. For example, McAdams et al. (1995) found that similar proportions of nonmusicians, music students, and professional musicians fell into the different latent classes, suggesting that whereas listeners differ in terms of the perceptual weight accorded to the different dimensions, these interindividual differences are unrelated to musical training. It may be that because timbre perception is so closely allied with the ability to recognize sound sources in everyday life, everybody is an expert to some degree, although different people are sensitive to different features. An example timbre space, drawn from McAdams et al. (1995), is shown in Figure 2. It is derived from the dissimilarity ratings of 84 listeners including nonmusicians, music students, and professional musicians. Listeners were presented digital simulations of instrument sounds and chimric sounds combining features of different instruments (such as the vibrone with both vibraphonelike and

2. Musical Timbre Perception

39

short

4 3 2

vbs ols hrp

vbn gtr pno

hcd

obc

Dimension 1 (log attack time)

1 0 -1 -2 gtn -3

cnt sno

tbn fhn stg tpt

tpr ehn bsn

long low

3 2 1 (sp Dim 0 ec en tra sio lc n2 en tro id) -2 -1 -2


high

-3
more

-3 3
less

-1

3 nsion Dime ral flux) ct (spe

Figure 2 The timbre space found by McAdams et al. (1995) for a set of synthesized sounds. The CLASCAL solution has three dimensions with specificities (the strength of the specificity is shown by the size of the square). The acoustic correlates for each dimension are also indicated. (vbs 5 vibraphone, hrp 5 harp, ols 5 obolesta (oboe/celesta hybrid), gtr 5 guitar, pno 5 piano, vbn 5 vibrone (vibraphone/trombone hybrid), hcd 5 harpsichord, obc 5 obochord (oboe/harpsichord hybrid), gtn 5 guitarnet (guitar/clarinet hybrid), cnt 5 clarinet, sno 5 striano (bowed string/piano hybrid), tbn 5 trombone, fhn 5 French horn, stg 5 bowed string, tpr 5 trumpar (trumpet/guitar hybrid), ehn 5 English horn, bsn 5 bassoon, tpt 5 trumpet). Modified from Figure 1, McAdams et al. (1995). 1995 by Springer-Verlag. Adapted with permission.

trombonelike features). Wessel, Bristow, and Settel (1987) created these sounds on a Yamaha DX7 FM synthesizer. A CLASCAL analysis revealed three shared dimensions, the existence of specificities on the sounds, and five latent classes of listeners, for whom the relative weights on the shared dimensions and set of specificities differed. The relative weights on the three dimensions and the set of specificities for the five latent classes are shown in Figure 3. Most listeners were in classes 1 and 2 and had fairly equal weights across dimensions and specificities. What distinguished these two classes was simply the use of the rating scale: Class 1 listeners used

40

Stephen McAdams

1.6

1.4

Class 1 Class 2 Class 3 Class 4 Class 5

1.2 Normalized weight

1.0

0.8

0.6

0.4 Dim 1 Dim 2 Dim 3 Specif

Figure 3 Normalized weights on the three shared dimensions and the set of specificities for five latent classes of listeners in the McAdams et al. (1995) study.

more of the scale than did listeners from Class 2. For the other three classes, however, some dimensions were prominent (high weights) and others were perceptually attenuated (low weights). For example, Class 3 listeners gave high weight to Dimension 2, which seems to be related to spectral characteristics of the sounds, and low weight on the specificities. Inversely, Class 4 listeners favored Dimension 1 (related to the temporal dimension of attack time) and the specificities and attenuated the spectral (Dim 2) and spectrotemporal (Dim 3) dimensions. Timbre space models have been useful in predicting listeners perceptions in situations beyond those specifically measured in the experiments, which suggests that they do in fact capture important aspects of timbre representation. Consistent with the predictions of a timbre model, Grey and Gordon (1978) found that by exchanging the spectral envelopes on pairs of sounds that differed primarily along one of the dimensions of their space believed to be related to spectral properties, these sounds switched positions along this dimension. Timbre space has also been useful in predicting the perception of intervals between timbres, as well as stream segregation based on timbre-related acoustic cues (see below).

2. Musical Timbre Perception


6
obochord trumpar

41

2 Dimension 2
striano

oboe bassoon harpsichord bowed string English horn trumpet

guitar

0
piano sampled piano guitarnet bowed piano

Amplitude

oboe

trombone Amplitude

harp

4 6 8 10 12 14 16 SC = 4.3 Harmonic rank

clarinet vibraphone obolesta

vibrone

French horn trombone

6 2.5 0 2 4 SC = 2.6 6 8 10 12 14 16 Harmonic rank 3.0 3.5 4.0 4.5 5.0 5.5 Spectral centroid (SC, harmonic rank)

Figure 4 Spectral centroid in relation to the second dimension of Krumhansls (1989) space using the synthesized sounds from Wessel et al. (1987). The graphs at the left and right represent the frequency spectra of two of the sounds (trombone and oboe, respectively). The arrowhead on the x axis indicates the location of the spectral centroid. The graph in the middle shows the regression of spectral centroid (x axis) onto the position along the perceptual dimension (y axis). Note that all the points are very close to the regression line, indicating a close association between the physical and perceptual parameters.

B. Audio Descriptors of Timbral Dimensions


In many studies, independent acoustic correlates have been determined for the continuous dimensions by correlating the position along the perceptual dimension with a unidimensional acoustic parameter extracted from the sounds (e.g., Grey & Gordon, 1978; Kendall et al., 1999; Krimphoff, McAdams, & Winsberg, 1994; McAdams et al., 1995). We will call such parameters audio descriptors, although they are also referred to as audio features in the field of music information retrieval. The most ubiquitous correlates derived from musical instrument sounds include spectral centroid (representing the relative weights of high and low frequencies and corresponding to timbral brightness or nasality: an oboe has a higher spectral centroid than a French horn; see Figure 4), the logarithm of the attack time (distinguishing continuant instruments that are blown or bowed from impulsive instruments that are struck or plucked; see Figure 5), spectral flux (the degree of evolution of the spectral shape over a tones duration which is high for brass and lower for single reeds; see Figure 6), and spectral deviation (the degree of jaggedness of the spectral shape, which is high for clarinet and vibraphone and low for trumpet; see Figure 7). Caclin et al. (2005) conducted a confirmatory study employing dissimilarity ratings on purely synthetic sounds in which the exact nature of the stimulus dimensions could be controlled. These authors confirmed the

42
vibraphone Amplitude
vibraphone

Stephen McAdams

attack time = 4 ms
6 harp 4

guitar obolesta harpsichord sampled piano piano obochord trumpar

bowed piano

0.00

0.19

0.38

0.57

0.75

Time (sec)
Dimension 1

vibrone

Amplitude
guitarnet trumpet bowed string

attack time = 330 ms

0 striano 2

0.16

0.33

0.49

0.65

0.82

Time (sec)
4 English horn trombone oboe bassoon bowed piano

clarinet French horn 1 0

8 3

log (attack time)

Figure 5 Log attack time in relation to the first dimension of Krumhansls (1989) space. The graphs on the left and right sides show the amplitude envelopes of the vibraphone and bowed piano sounds. The attack time is indicated by the arrows.

perception of stimulus dimensions related to spectral centroid, log attack time, and spectral deviation but did not confirm spectral flux. Of the studies attempting to develop audio descriptors that are correlated with the perceptual dimensions of their timbre spaces, most have focused on a small set of sounds and a small set of descriptors. Over the years, a large set of descriptors has been developed at IRCAM (Institut de Recherche et Coordination Acoustique/ Musique) starting with the work of Jochen Krimphoff (Krimphoff et al., 1994). The aim was to represent a wide range of temporal, spectral, and spectrotemporal properties of the acoustic signals that could be used as metadata in content-based searches in very large sound databases. The culmination of this work has recently been published (Peeters, Giordano, Susini, Misdariis, & McAdams, 2011) and the Timbre Toolbox has been made available in the form of a Matlab toolbox2 that contains a set of 54 descriptors based on energy envelope, short-term Fourier transform, harmonic sinusoidal components, or the gamma-tone filter-bank model of ` re, 1995). These peripheral auditory processing (Patterson, Allerhand, & Gigue audio descriptors capture temporal, spectral, spectrotemporal, and energetic properties of acoustic events. Temporal descriptors include properties such as attack, decay, release, temporal centroid, effective duration, and the frequency and amplitude of modulation in the energy envelope. Spectral shape descriptors include
2

http://recherche.ircam.fr/pub/timbretoolbox or http://www.cirmmt.mcgill.ca/research/tools/timbretoolbox

3 trombone 1300 1250 1200 Spectral centroid (Hz) 1150 Dimension 3 1100 1050 1000 950 900 850 800 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Time (sec) obochord 3 .94 .95 .96 .97 Spectral flux .98 .99 1 2 English horn 1 harpsichord 0 2 vibraphone guitar guitarnet bowed string obolesta clarinet trumpar harp 1 trombone French horn bassoon sampled piano Spectral centroid (Hz) 1300 1250 1200 1150 1100 1050 1000 950 900 850 800 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Time (sec) trumpet piano sampled piano

vibrone

Figure 6 Spectral flux in relation to the third dimension of the space found by McAdams et al. (1995). The left and right graphs show the variation over time of the spectral centroid for the trombone and the sampled piano. Note that the points are more spread out around the regression line in the middle graph, indicating that this physical parameter explains much less of the variance in the positions of the sounds along the perceptual dimension.

44
trumpet SD = 5.7 8 trumpet 4 trumpar trombone striano bowed string sampled piano 0 4 8 10 14 18 22 26 Harmonic rank Dimension 3 2 vibrone harpsiohord 0 French piano horn bassoon bowed piano 2 harp obolesta clarinet 4 English horn oboe obochord 0 10 0 10 20 30 40 50 4 8 vibraphone guitar guitanet Amplitude

Stephen McAdams

Amplitude

clarinet

SD = 41.4

10

14

18 22 26 Harmonic rank

Spectral deviation (SD in dB)

Figure 7 Spectral deviation in relation to the third dimension of the space found by Krumhansl (1989). The left and right graphs show the frequency spectra and global spectral envelopes of the trumpet and clarinet sounds. Note that the amplitudes of the frequency components are close to the global envelope for the trumpet, but deviate above and below this envelope for the clarinet.

measures of the centroid, spread, skewness, kurtosis, slope, rolloff, crest factor, and jaggedness of the spectral envelope. Spectrotemporal descriptors include spectral flux. Energetic descriptors include harmonic energy, noise energy, and statistical properties of the energy envelope. In addition, descriptors related to periodicity/ harmonicity and noisiness were included. Certain of these descriptors have a single value for a sound event, such as attack time, whereas others represent time-varying quantities, such as the variation of spectral centroid over the duration of a sound event. Statistical properties of these time-varying quantities can then be used, such as measures of central tendency or variability (robust statistics of median and interquartile range were used by Peeters et al., 2011). One problem with a large number of descriptors is that they may be correlated among themselves for a given set of sounds, particularly if they are applied to a limited sound set. Peeters et al. (2011) examined the information redundancy across the audio descriptors by performing correlational analyses between descriptors calculated on a very large set of highly heterogeneous musical sounds (more than 6000 sounds from the McGill University Master Samples, MUMS; Opolko & Wapnick, 2006). They then subjected the resulting correlation matrix to hierarchical clustering. The analysis also sought to assess whether the Timbre Toolbox could account for the dimensional richness of real musical sounds and to provide a user of the Toolbox with a set of guidelines for selecting among the numerous descriptors implemented therein. The analyses yielded roughly 10 classes of descriptors that are relatively independent. Two clusters represented spectral shape

2. Musical Timbre Perception

45

properties, one based primarily on median values (11 descriptors) and the other uniquely on the interquartile ranges of the time-varying measures of these spectral properties (7 descriptors). Thus central tendencies and variability of spectral shape behave independently across the MUMS database. A large third cluster of 16 descriptors included most of the temporal descriptors, such as log attack time, and energetic descriptors, such as variability in noise energy and total energy over time. A fourth large cluster included 10 descriptors related to periodicity, noisiness, and jaggedness of the spectral envelope. The remaining smaller clusters had one or two descriptors each and included descriptors of spectral shape, spectral variation, and amplitude and frequency of modulations in the temporal envelope. The combination of a quantitative model of perceptual relations among timbres and the psychophysical explanation of the parameters of the model is an important step in gaining predictive control of timbre in several domains such as sound analysis and synthesis and intelligent content-based search in sound databases (McAdams & Misdariis, 1999; Peeters, McAdams, & Herrera, 2000). Such representations are only useful to the extent that they are (a) generalizable beyond the set of sounds actually studied, (b) robust with respect to changes in musical context, and (c) generalizable to other kinds of listening tasks than those used to construct the model. To the degree that a representation has these properties, it may be considered as an accurate account of musical timbre, characterized by an important feature of a scientific model, the ability to predict new empirical phenomena.

C. Interaction of Timbre with Pitch and Dynamics


Most timbre space studies have restricted the pitch and loudness to single values for all of the instrument sounds compared in order to focus listeners attention on timbre alone. An important question arises, however, concerning whether the timbral relations revealed for a single pitch and/or a single dynamic level hold at different pitches and dynamic levels and, more importantly for extending this work to real musical contexts, whether they hold for timbres being compared across pitches and dynamic levels. It is clear that for many instruments the timbre varies as a function of pitch because the spectral, temporal, and spectrotemporal properties of the sounds covary , McAdams, and Winsberg (2003) have shown with pitch. Marozeau, de Cheveigne that timbre spaces for recorded musical instrument tones are similar at different pitches (B3, Cx4, Bw4). Listeners are also able to ignore pitch differences within an octave when asked to compare only the timbres of the tones. When the pitch variation is greater than an octave, interactions between the two attributes occur. (2007) varied the brightness of a set of synthesized Marozeau and de Cheveigne sounds, while also varying the pitch over a range of 18 semitones. They found that differences in pitch affected timbre relations in two ways: (1) pitch shows up in the timbre space representation as a dimension orthogonal to the timbre dimensions (indicating simply that listeners were no longer ignoring the pitch difference), and (2) pitch differences systematically affect the timbre dimension related to spectral centroid. Handel and Erickson (2004) also found that listeners had difficulty

46

Stephen McAdams

extrapolating the timbre of a sound source across large differences in pitch. Inversely, Vurma, Raju, and Kuuda (2011) have reported that timbre differences on two tones for which the in-tuneness of the pitches was to be judged affected the pitch judgments to an extent that could potentially lead to conflicts between subjective and fundamental-frequency-based assessments of tuning. Krumhansl and Iverson (1992) found that speeded classifications of pitches and of timbres were symmetrically affected by uncorrelated variation along the other parameter. These results suggest a close relation between timbral brightness and pitch height and perhaps even more temporally fine-grained features related to the coding of periodicity in the auditory system or larger-scale timbral properties related to the energy envelope. This link would be consistent with underlying neural representations that share common attributes, such as tonotopic and periodicity organizations in the brain. Similarly to pitch, changes in dynamics also produce changes in timbre for a given instrument, particularly, but not exclusively, as concerns spectral properties. Sounds produced with greater playing effort (e.g., fortissimo vs. pianissimo) not only have greater energy at the frequencies present in the softer sound, but the spectrum spreads toward higher frequencies, creating a higher spectral centroid, a greater spectral spread, and a lower spectral slope. No studies to date of which we are aware have examined the effect of change in dynamic level on timbre perception, but some work has looked at the role of timbre in the perception of dynamic level independently of the physical level of the signal. Fabiani and Friberg (2011) studied the effect of variations in pitch, sound level, and instrumental timbre (clarinet, flute, piano, trumpet, and violin) on the perception of the dynamics of isolated instrumental tones produced at different pitches and dynamics. They subsequently presented these sounds to listeners at different physical levels. Listeners were asked to indicate the perceived dynamics of each stimulus on a scale from pianissimo to fortissimo. The results showed that the timbral effects produced at different dynamics, as well as the physical level, had equally large effects for all five instruments, whereas pitch was relevant mostly for clarinet, flute, and piano. Thus estimates of the dynamics of musical tones are based both on loudness and timbre, and to a lesser degree on pitch as well.

II.

Timbre as a Vehicle for Source Identity

The second approach to timbre concerns its role in the recognition of the identity of a musical instrument or, in general, of a sound-generating event, that is, the interaction between objects, or a moving medium (air) and an object, that sets up vibrations in the object or a cavity enclosed by the object. One reasonable hypothesis is that the sensory dimensions that compose timbre serve as indicators used in the categorization, recognition, and identification of sound events and sound sources (Handel, 1995; McAdams, 1993). Research on musical instrument identification is relevant to this issue. Saldanha and Corso (1964) studied identification of isolated musical instrument sounds from

2. Musical Timbre Perception

47

the Western orchestra played with and without vibrato. They were interested in the relative importance of onset and offset transients, spectral envelope of the sustain portion of the sound, and vibrato. Identification of isolated sounds is surprisingly poor for some instruments. When attacks and decays were excised, identification decreased markedly for some instruments, particularly for the attack portion in sounds without vibrato. However when vibrato was present, the effect of cutting the attack was less, identification being better. These results suggest that important information for instrument identification is present in the attack portion, but that in the absence of the normal attack, additional information is still available in the sustain portion, particularly when vibrato is present (although it is more important for some instruments than others). The vibrato may increase our ability to extract information relative to the resonance structure of the instrument (McAdams & Rodet, 1988). Giordano and McAdams (2010) performed a meta-analysis on previously published data concerning identification rates and dissimilarity ratings of musical instrument tones. The goal of this study was to ascertain the extent to which tones generated with large differences in the mechanisms for sound production were recovered in the perceptual data. Across all identification studies, listeners frequently confused tones generated by musical instruments with a similar physical structure (e.g., clarinets and saxophones, both single-reed instruments) and seldom confused tones generated by very different physical systems (e.g., the trumpet, a lip-valve instrument, and the bassoon, a double-reed instrument). Consistently, the vast majority of previously published timbre spaces revealed that tones generated with similar resonating structures (e.g., string instruments vs. wind instruments) or with similar excitation mechanisms (e.g., impulsive excitation as in piano tones vs. sustained excitation as in flute tones) occupied the same region in the space. These results suggest that listeners can reliably identify large differences in the mechanisms of tone production, focusing on the timbre attributes used to evaluate the dissimilarities among musical sounds. Several investigations on the perception of everyday sounds extend the concept of timbre beyond the musical context (see McAdams, 1993; Handel, 1995; Lutfi, 2008, for reviews). Among them, studies on impact sounds provide information on the timbre attributes useful to the perception of the properties of percussion instru , 1997), bar material ments: bar geometry (Lakatos, McAdams & Causse (McAdams, Chaigne, & Roussarie, 2004), plate material (Giordano & McAdams, 2006; McAdams, Roussarie, Chaigne, & Giordano, 2010), and mallet hardness (Freed, 1990; Giordano, Rocchesso, & McAdams, 2010). The timbral factors relevant to perceptual judgments vary with the task at hand. Spectral factors are primary for the perception of geometry (Lakatos et al., 1997). Spectrotemporal factors (e.g., the rate of change of spectral centroid and loudness) dominate the perception of the material of struck objects (McAdams et al., 2004; Giordano & McAdams, 2006) and of mallets (Freed, 1990). But spectral and temporal factors can also play a role in the perception of different kinds of gestures used to set an instrument into vibration, such as the angle and position of a plucking finger on a guitar string (Traube, Depalle & Wanderley, 2003).

48

Stephen McAdams

The perception of an instruments identity in spite of variations in pitch may be related to timbral invariance, those aspects of timbre that remain constant with change in pitch and loudness. Handel and Erickson (2001) found that musically untrained listeners are able to recognize two sounds produced at different pitches as coming from the same instrument or voice only within a pitch range of about an octave. Steele and Williams (2006) found that musically trained listeners could perform this task at about 80% correct even with pitch differences on the order of 2.5 octaves. Taken together, these results suggest that there are limits to timbral invariance across pitch, but that they depend on musical training. Its role in source identification and categorization is perhaps the more neglected aspect of timbre and brings with it advantages and disadvantages for the use of timbre as a form-bearing dimension in music (McAdams, 1989). One of the advantages is that categorization and identification of a sound source may bring into play perceptual knowledge (acquired by listeners implicitly through experience in the everyday world and in musical situations) that helps them track a given voice or instrument in a complex musical texture. Listeners do this easily and some research has shown that timbral factors may make an important contribution in such voice tracking (Culling & Darwin, 1993; Gregory, 1994), which is particularly important in polyphonic settings. The disadvantages may arise in situations in which the composer seeks to create melodies across instrumental timbres, e.g., the Klangfarbenmelodien of Schoenberg (1911/1978). Our predisposition to identify the sound source and follow it through time would impede a more relative perception in which the timbral differences were perceived as a movement through timbre space rather than as a simple change of sound source. For cases in which such timbral compositions work, the composers have often taken special precautions to create a musical situation that draws the listener more into a relative than into an absolute mode of perceiving.

III.

Timbre as a Structuring Force in Music Perception

Timbre perception is at the heart of orchestration, a realm of musical practice that has received relatively little experimental study or even music-theoretic treatment for that matter. Instrumental combinations can give rise to new timbres if the sounds are perceived as blended. Timbral differences can also both create the auditory streaming of similar timbres and the segregation of dissimilar timbres, as well as induce segmentations of sequences when timbral discontinuities occur. Listeners can perceive intervals between timbres as similar when they are transposed to a different part of timbre space, even though such relations have not been used explicitly in music composition. Timbre can play a role in creating and releasing musical tension. And finally, there is some evidence that listeners can learn statistical regularities in timbre sequences, opening up the possibility of developing timbre-based grammars in music.

2. Musical Timbre Perception

49

A. Timbral Blend
The creation of new timbres through orchestration necessarily depends on the degree to which the constituent sound sources fuse together or blend to create the newly emergent sound (Brant, 1971; Erickson, 1975). Sandell (1995) has proposed that there are three classes of perceptual goals in combining instruments: timbral heterogeneity in which one seeks to keep the instruments perceptually distinct, timbral augmentation in which one instrument embellishes another one that perceptually dominates the combination, and timbral emergence in which a new sound results that is identified as none of its constituents. Blend appears to depend on a number of acoustic factors such as onset synchrony of the constituent sounds and others that are more directly related to timbre, such as the similarity of the attacks, the difference in the spectral centroids, and the overall centroid of the combination. For instance, Sandell (1989) found that by submitting blend ratings taken as a measure of proximity to multidimensional scaling, a blend space could be obtained; the dimensions of this space were correlated with attack time and spectral centroid, suggesting that the more these parameters were similar for the two combined sounds, the greater their blend (Figure 8). A similar trend concerning the role of spectrotemporal similarity in blend was found for wind instrument combinations by Kendall and Carterette (1993). These authors also revealed an inverse relation between blend and identifiability of the constituent sounds, i.e., sounds that blend

Dimension 2 (Spectral centroid)

TM

X2 C1 O2 X1 TP C2 S2 BN FH X3 Dimension 1 (Attack time) S1 S3 EH FL

Figure 8 Multidimensional analysis of blend ratings for all pairs of sounds drawn from the timbre space of Grey (1977). If two instruments are close in the space (e.g., BN and S1), the degree of blend is rated as being strong. If they are far apart (e.g., TP and X2), the blending is weak and the sounds tend to be heard separately. The dimensions of this blend space are moderately correlated with the attack time (x axis) and strongly correlated with spectral centroid (y axis). (TM 5 muted trombone, C1-C2 5 clarinets, O1-O2 5 oboes, TP 5 trumpet, BN 5 bassoon, FH 5 French horn, FL 5 flute, S1-S3 5 strings, X1-X3 5 saxophones, EH 5 English horn). 1989 by Gregory Sandell. Adapted with permission.

50

Stephen McAdams

better are more difficult to identify separately in the mixture. For dyads of impulsive and continuant sounds, the blend is greater for slower attacks and lower spectral centroids and the resulting emergent timbre is determined primarily by the properties of the impulsive sound (Tardieu & McAdams, in press).

B. Timbre and Musical Grouping


An important way in which timbre can contribute to the organization of musical structure is related to the fact that listeners tend to perceptually connect sound events that arise from the same sound source. In general, a given source will produce sounds that are relatively similar in pitch, loudness, timbre, and spatial position from one event to the next (see Bregman, 1990, Chapter 2; McAdams & Bregman, 1979, for reviews). The perceptual connection of successive sound events into a coherent message through time is referred to as auditory stream integration, and the separation of events into distinct messages is called auditory stream segregation (Bregman & Campbell, 1971). One guiding principle that seems to operate in the formation of auditory streams is the following: successive events that are relatively similar in their spectrotemporal properties (i.e., in their pitches and timbres) may have arisen from the same source and should be grouped together; individual sources do not tend to change their acoustic properties suddenly and repeatedly from one event to the next. Early demonstrations (see Figure 9) of auditory streaming on the basis of timbre suggest a link between the timbre-space representation and the tendency for auditory streaming on the basis of the spectral differences that are created (McAdams & Bregman, 1979; Wessel, 1979). Hartmann and Johnsons (1991) experimental results convinced them that it was primarily the spectral aspects of timbre (such as spectral centroid) that were responsible for auditory streaming and that temporal aspects (such as attack time) had little effect. More recently the picture has changed significantly, and several studies indicate an important role for both spectral and temporal attributes of

Pitch

Time

Pitch

Time

Figure 9 The two versions of a melody created by David Wessel with one instrument (top) or two alternating instruments (bottom). In the upper single-timbre melody, a single rising triplet pattern is perceived. In the lower alternating-timbre melody, if the timbral difference is sufficient, two interleaved patterns of descending triplets at half the tempo of the original sequence are heard.

2. Musical Timbre Perception

51

timbre in auditory stream segregation (Moore & Gockel, 2002). Iverson (1995) used sequences alternating between two recorded instrument tones with the same pitch and loudness and asked listeners to judge the degree of segregation. Multidimensional scaling of the segregation judgments treated as a measure of dissimilarity was performed to determine which acoustic attributes contributed to the impression of auditory stream segregation. A comparison with previous timbrespace work using the same sounds (Iverson & Krumhansl, 1993) showed that both static acoustic cues (such as spectral centroid) and dynamic acoustic cues (such as attack time and spectral flux) were implicated in segregation. This result was refined in an experiment by Singh and Bregman (1997) in which amplitude envelope and spectral content were independently varied and their relative contributions to stream segregation were measured. For the parameters used, a change from two to four harmonics produced a greater effect on segregation than did a change from a 5-ms attack and a 95-ms decay to a 95-ms attack and a 5-ms decay. Combining the two gave no greater segregation than was obtained with the spectral change, suggesting a stronger contribution of this sound property to segregation. Bey and McAdams (2003) used a melody discrimination paradigm in which a target melody interleaved with a distractor melody was presented first, followed by a test melody that was either identical to the target or differed by two notes that changed the contour (Figure 10). The timbre difference between target and distractor melodies was varied within the timbre space of McAdams et al. (1995).

Mixture (Target + Distractor) Frequency

Test

Time Frequency

Time

Figure 10 Sequences used for testing the role of timbre in stream segregation. The task was to determine whether the isolated test melody had been present in the mixture of the target melody (empty circles) and an interleaved distractor melody (filled circles, with the darkness indicating degree of timbre difference between distractor and target). The test and target melodies always had the same timbre. Redrawn from Figure 2, Bey and McAdams (2003). 2003 by The American Psychological Association, Inc. Adapted with permission.

52

Stephen McAdams

1 0.9 Mean proportiion correct 0.8 0.7 0.6 0.5 0.4 0 1 2 3 4 5 6 Distance between timbres 7 8 9

Figure 11 A monotone relation between the timbral distance and the rate of discrimination between target and test melodies shows that distance in timbre space predicts stream segregation. Redrawn from Figure 4, Bey and McAdams (2003). 2003 by The American Psychological Association, Inc. Adapted with permission.

In line with the previously cited results, melody discrimination increased monotonically with the distance between the target and distractor timbres, which varied along the dimensions of attack time, spectral centroid, and spectral flux (Figure 11). All of these results are important for auditory stream segregation theory, because they show that several of a sources acoustic properties are taken into account when forming auditory streams. They are also important for music making (whether it be with electroacoustic or acoustic instruments), because they show that many aspects of timbre strongly affect the basic organization of the musical surface into streams. Different orchestrations of a given pitch sequence can completely change what is heard as melody and rhythm, as has been demonstrated by Wessel (1979). Timbre is also an important component in the perception of musical groupings, whether they are at the level of sequences of notes being set off by sudden ` ge, 1987) or of larger-scale musical sections delimited by changes in timbre (Delie ` ge, 1989). marked changes in orchestration and timbral texture (Delie

C. Timbral Intervals
Consider the timbral trajectory shown in Figure 12 through the McAdams et al. (1995) timbre space starting with the guitarnet (gtn) and ending with the English horn (ehn). How would one construct a melody starting from the bowed string (stg) so that it would be perceived as a transposition of this Klangfarbenmelodie? The notion of transposing the relation between two timbres to another point in the timbre space poses the question of whether listeners can indeed perceive timbral

2. Musical Timbre Perception

53

short 4 3 2
Dimension 1 (log attack time)

vbs hrp ols vbn pno obc

gtr

hcd

1 0 1 2 3 gtn cnt sno fhn stg 2 1 Di pe me 0 ctr ns al io 1 ce n 2 nt 2 ro id) 2 high 3 3 more tpt

tbn

tpr ehn bsn

long low 3
(s

3 less

3 nsion Dime al flux) tr (spec

Figure 12 A trajectory of a short timbre melody through timbre space. How would one transpose the timbre melody starting on gtn to one starting on stg?

intervals. If timbral interval perception can be demonstrated, it opens the door to applying some of the operations commonly used on pitch sequences to timbre sequences (Slawson, 1985). Another interest of this exploration is that it extends the use of the timbre space as a perceptual model beyond the dissimilarity paradigm. Ehresman and Wessel (1978) took a first step forward in this direction. Based on previous work on semantic spaces and analogical reasoning (Henley, 1969; Rumelhart & Abrahamson, 1973), they developed a task in which listeners were asked to make judgments on the similarity of intervals formed between pairs of timbres. The basic idea was that timbral intervals may have properties similar to pitch intervals; that is, a pitch interval is a relation along a well-ordered dimension that retains a degree of invariance under certain kinds of transformation, such as translation along the dimension, or what musicians call transposition. But what does transposition mean in a multidimensional space? A timbral interval can be considered as a vector in space connecting two timbres. It has a specific length (the distance between the timbres) and a specific orientation. Together these two properties define the amount of change along each dimension of the space that is needed to move from one timbre to another. If we assume these dimensions to be continuous

54

Stephen McAdams

short 4 D1 3 2
Dimension 1 (log attack time)

D3

D4 C
D2

1 0 -1

A -2 -3 long low 3 2 1 (sp Dim ec ens 0 tra i l c on 2 en tro id) -2 -1 -2 -3 3 high less 1 2
on 3 ensi Dim al flux) ctr (spe

-1

-3 more

Figure 13 Examples of timbral intervals in a timbre space. The aim is to find an interval starting with C and ending on a timbre D that resembles the interval between timbres A and B. If we present timbres D1D4 (in a manner similar to that of Ehresman & Wessel, 1978), the vector model would predict that listeners would prefer D2, because the vector CD2 is the closest in length and orientation to that of AB.

and linear from a perceptual point of view, then pairs of timbres characterized by the same vector relation should have the same perceptual relation and thus embody the same timbral interval. Transposition thus consists of translating the vector anywhere else in the space as long as its length and orientation are preserved. Ehresman and Wessel (1978) tested this hypothesis using a task in which listeners had to compare two timbral intervals (e.g., A-B vs. C-D) and rank various timbre Ds according to how well they fulfilled the analogy: timbre A is to timbre B as timbre C is to timbre D (see Figure 13). They essentially found that the closer timbre D was to the ideal point defined by the vector model in timbre space, the higher the ranking, i.e., the ideal C-D vector was a simple translation of the A-B vector and A, B, C and D form a parallelogram (shown with dashed lines in Figure 13). McAdams and Cunibile (1992) subsequently tested the vector model using the 3D space from Krumhansl (1989) (ignoring the specificities). Five sets of timbres

2. Musical Timbre Perception

55

at different places in timbre space were chosen for each comparison to test for the generality of the results. Both electroacoustic composers and nonmusicians were tested to see if musical training and experience had any effect. All listeners found the task rather difficult to do, which is not surprising given that even professional composers have had almost no experience with music that uses timbral intervals in a systematic way. The main result is encouraging in that the data globally support the vector model, although this support was much stronger for electroacoustic composers than for nonmusicians. However, when one examines in detail the five different versions of each comparison type, it is clear that not all timbre comparisons go in the direction of the model predictions. One confounding factor is that the specificities on some timbres in this set were ignored. These specificities would necessarily distort the vectors that were used to choose the timbres, because they are like an additional dimension for each timbre. As such, certain timbral intervals correspond well to what is predicted because specificities are absent or low in value, whereas others are seriously distorted and thus not perceived as similar to other intervals due to moderate or high specificity values. What this line of reasoning suggests is that the use of timbral intervals as an integral part of a musical discourse runs the risk of being very difficult to achieve with very complex and idiosyncratic sound sources, because they will in all probability have specificities of some kind or another. The use of timbral intervals may, in the long run, be limited to synthesized sounds or blended sounds created through the combination of several instruments.

D. Building and Releasing Musical Tension with Timbre


Timbre can also contribute to larger scale musical form and in particular to the sense of movement between tension and relaxation. This movement has been considered by many music theorists as one of the primary bases for the perception of larger scale form in music. It has traditionally been tied to harmony in Western music and plays an important role in Lerdahl and Jackendoffs (1983) generative theory of tonal music. Experimental work on the role of harmony in the perception of musical tension and relaxation (or inversely, in the sense of tension that accompanies a moment at which the music must continue and the sense of relaxation that accompanies the completion of the musical phrase) has suggested that auditory roughness is an important component of perceived tension (Bigand, Parncutt, & Lerdahl, 1996). Roughness is an elementary timbral attribute based on the sensation of rapid fluctuations in the amplitude envelope. It can be generated by proximal frequency components that beat with one another. Dissonant intervals tend to have more such beating than consonant intervals. As such, a fairly direct relation between sensory dissonance and roughness has been demonstrated (cf. Parncutt, 1989; Plomp, 1976, for reviews). As a first step toward understanding how this operates in music, Paraskeva and McAdams (1997) measured the inflection of musical tension and relaxation due to timbral change. Listeners were asked to make judgments on a seven-point scale concerning the perceived degree of completion of the music at several points at

56
Bach Ricercar Tonal * * 4 3 2 1 5 10 15 Segment 20 * * * * * * * * * * least complete tension 25 piano orchestra Webern 6 Pieces Nontonal * * * * * * * *

Stephen McAdams

7 Mean completion 6 5

Mean completion

most complete release

7 6 5 4 3 2 1

most complete release

least complete tension 5 10 15 Segment 20 25

Figure 14 Rated degree of completion at different stopping points (segments) for works by Bach and Webern, averaged over musician and nonmusician groups. The filled circles correspond to the piano version and the open circles to the orchestral version. The vertical bars represent the standard deviation. The asterisks over certain segments indicate a statistical difference between the two versions for that stopping point. Redrawn from Figure 1 in Paraskeva and McAdams (1997). 1997 by the authors. Adapted with permission.

which the music stopped. What results is a completion profile (Figure 14), which can be used to infer musical tension by equating completion with release and lack of completion with tension. Two pieces were tested: a fragment of the Ricercar from the Musical Offering for six voices by Bach (tonal) and the first movement of the Six Pieces for Orchestra, Op. 6 by Webern (nontonal). Each piece was played in an orchestral version (Weberns orchestration of the Musical Offering was used for the Bach) and in a direct transcription of this orchestral version for piano on a digital sampler. Although there were only small differences between the profiles for musicians and nonmusicians, there were significant differences between the piano and orchestral versions, indicating a significant effect of timbre change on perceived musical tension. However, when they were significantly different, the orchestral version was always more relaxed than the piano version. The hypothesis advanced by Paraskeva and McAdams (1997) for this effect was that the higher relaxation of the orchestral version might have been due to processes involved in auditory stream formation and the dependence of perceived roughness on the results of such processes (Wright & Bregman, 1987). Roughness, or any other auditory attribute of a single sound event, is computed after auditory organization processes have grouped the bits of acoustic information together. Piano sounds have a rather sharp attack. If several notes occur at the same time in the score and are played with a piano sound, they will be quite synchronous. Because they all start at the same time and have similar amplitude envelopes and similar timbres, they will tend to be fused together. The computed roughness will then result from the interactions of all the frequency components of all the notes. The situation may be quite different for the orchestral version for two reasons. The first is that the same timing is used for piano and orchestra versions. In the latter, many instruments are used that have slow attacks, whereas others have faster attacks. There could then be greater asynchrony between the instruments in terms of perceived attack time (Gordon, 1987). In addition, because the timbres of these instruments are often quite different, several different voices with different timbres

2. Musical Timbre Perception

57

arrive momentarily at a given vertical sonority, but the verticality is not perceived because the listener would more likely continue to track individual instruments horizontally in separate auditory streams. So the attack asynchrony and the decomposition of verticalities into horizontalities would concur to reduce the degree of perceptual fusion. Reduced fusion would mean greater segregation. And thus the roughness in the orchestral version would be computed on each individually grouped auditory event rather than on the whole sound mass. These individual roughnesses in the orchestral version would most likely be much less than those of the piano version. So once again, timbral composition can have a very tight interaction with auditory scene analysis processes.

E. Implicit Learning of Timbre-Based Grammars


In order to use timbre syntactically in music, listeners would need to be able to learn rules for ordering timbres in sequences, as for duration and pitch. This possibility was first explored by Bigand, Perruchet, and Boyer (1998), who presented artificial grammars of musical sounds for which sequencing rules were created. After being exposed to sequences constructed with the grammar, listeners heard new sequences and had to decide whether each one conformed or not to the learned grammar, without having to say why. Indeed, with the implicit learning of the structures of language and music, we can know whether a sequence corresponds to our language without knowing why: it just doesnt sound right. The correct response rate was above chance for these sequences, demonstrating the listeners ability to learn a timbral grammar. Tillmann and McAdams (2004) extended this work by studying the influence of acoustic properties on implicit learning of statistical regularities (transition probabilities between temporally adjacent events) in sequences of musical sounds differing only in timbre. These regularities formed triplets of timbres drawn from the timbre space of McAdams et al. (1995). The transition probability between the first and second and between the second and third timbres was much higher than that between the third timbre of a given triplet and the first timbre of any other triplet in the language used in their experiment. In the implicit learning phase, listeners heard a rhythmically regular sequence of timbres, all at the same pitch and loudness, for 33 minutes. The sequence was composed of all of the triplets in the language in a varied sequence. The goal was to determine whether listeners could learn the regularities that defined the triplets by simply listening to the sequences for a fairly short time. In addition to the principle of higher transition probability between timbres within the triplets than between those in different triplets, the sequences were also constructed so that the auditory grouping on the basis of timbral similarity was either congruent with the triplet structure or not (Figure 15). To achieve this, three grammars were created. For the congruent sequence (S1), the timbres within each triplet were fairly close within the McAdams et al. (1995) timbre space, and the distance between the last timbre of one triplet and the first timbre of the succeeding triplet was large. If the timbral discontinuities created by the jumps in timbre space between triplets created a segmentation of the sequence, this segmentation would

58

Stephen McAdams

S1: congruent
4 3 2

S2: incongruent
hrp
4 3

vbs

Triplet 1
ols

Triplet 1 vbs hrp


ols gtr vbn pno hcd obc

gtr

pno

vbn hcd

obc

Dimension 1

Figure 15 Examples of timbre triplets used in the three timbral grammars drawn from the McAdams et al. (1995) timbre space. In S1 (congruent), the segmentation of the sequence into groups of timbres that are close in the space corresponded to the triplets of the grammar defined in terms of transition probabilities. In S2 (incongruent), the segmentation groups the last timbre of a triplet with the first of the next triplet, isolating the middle timbre of each triplet. In S3 (neutral), all timbres are more or less equidistant, thereby not creating segmentation.

correspond to the triplets themselves. For the incongruent sequence (S2), there was a large distance between successive timbres within the triplets and a small distance from one triplet to the next. Accordingly, sequential grouping processes would create segmentations into two timbres traversing adjacent triplets and an isolated timbre in the middle of each triplet. Finally, a third sequence (S3) was composed so that all of the distances within and between triplets were uniformly medium within the McAdams et al. (1995) space, thus avoiding segmentation. After listening to one of the three sequences for 33 minutes, two groups of three timbres were presented, and the listener had to decide which one formed a triplet that was present in the sequence just heard. Another group of listeners did not hear the 33-minute sequence beforehand and had to decide which of the two groups

Dimension 1

1 0 1 2 3 3 2

1 0 1 2 3 3 2

Triplet 2

gtn

cnt sno

tbn fhn stg tpt

tpr ehn bsn

gtn

cnt sno

tpr tbn fhn tpt


3

ehn bsn

stg

Dim1 0 en 1 sio 2 n2

Triplet 2
1 3 3 2 0 1

Dime

3 nsion

Dim1 0 en 1 sio 2 n2

2 1 3 3 2 0

Dime

3 nsion

S3: neutral
4 3 2

vbs ols gtr

hrp obc

vbn pno

hcd

Dimension 1

1 0 1 2 3 3 2

Triplet 1
cnt gtn sno stg tpt tbn fhn

Triplet 2
tpr ehn bsn

Dim1 0 en 1 sio 2 n2

2 1 3 3 2 0 1

Dime

nsion

2. Musical Timbre Perception

59

90 Learning 80 70 % correct 60 50 40 30 S2 S1 congruent incongruent S3 neutral Control

Figure 16 Percent correct choice of triplets of the constructed grammar for sequences in which the perceptual segmentation was congruent, incongruent, or neutral with respect to the triplets of the grammar. The control group did not hear the learning sequence before the test session. The learning group was exposed to the grammar for 33 minutes before the test session. Redrawn from Figure 1, Tillmann and McAdams (2004). 2004 by The American Psychological Association, Inc. Adapted with permission.

of three timbres best formed a unit that could be part of a longer sequence of timbres. Choices of a triplet that were part of the grammar were scored as correct. Listeners were able to learn the grammar implicitly by simply listening to it, because the correct response rates of the learning group were higher than those of the group who were not exposed to the sequences beforehand (Figure 16). But curiously, this learning did not depend on the congruence between the grouping structure created by the acoustic discontinuities and the structure created by the statistical regularities determined by the transition probabilities between timbres within and between triplets. The same increase in correct response rate was obtained for all three sequences. This result suggests that the choice was affected by the grouping structurelisteners prefer the well-formed tripletsbut the degree of statistical learning that occurred while listening to the sequences was the same in all conditions. The listeners thus seem to be able to learn the grammar constructed by the timbre-sequencing rule, whether the timbre sequences of the grammar are composed of similar or dissimilar timbres. Nevertheless, listeners prefer an organization in motifs compose of timbres that are close in timbre space and distant in timbre from other motifs.

IV.

Concluding Remarks

Musical timbre is a combination of continuous perceptual dimensions and discrete features to which listeners are differentially sensitive. The continuous dimensions often have quantifiable acoustic correlates. This perceptual structure is represented in a timbre space, a powerful psychological model that allows predictions to be made about timbre perception in situations both within and beyond those used to derive the model from dissimilarity ratings. Timbral intervals, for example, can be conceived as vectors within the space of common dimensions. Although the modeling of the interval relations can be perturbed if the sounds have specificities, it would not be affected by differential sensitivity of individual listeners to the

60

Stephen McAdams

common dimensions, since these would expand and contract all relations in a systematic way. Timbre space also makes at least qualitative predictions about the magnitude of timbre differences that will provoke auditory stream segregation. The further apart the timbres are in the space, the greater the probability that interleaved pitch sequences played with them will form separate streams, thereby allowing independent perception and recognition of the constituent sequences. The formalization of audio descriptors to capture quantitatively the acoustic properties that give rise to many aspects of timbre perception is beginning to provide an important set of tools that benefits several domains, including the use of signalbased metadata related to timbre that can be used in automatic instrument recognition and categorization (Eronen & Klapuri, 2000; Fujinaga & MacMillan, 2000), content-based searches in very large sound and music databases (Kobayashi & Osaka, 2008), characterization of sound and music samples in standards such as MPEG (Peeters et al., 2000), and many other music information retrieval and musical machine learning applications. These descriptors, particularly the time-varying ones, are proving to be useful in computer-aided orchestration environments (Carpentier, Tardieu, Harvey, Assayag, & Saint-James, 2010; Esling, Carpentier, & Agon, 2010; Rose & Hetrick, 2007), in which the research challenge is to predict the perceptual results of instrumental combinations and sequencings to fit a goal expressed by a composer, arranger, or sound designer. Timbre can also play a role in phrase-level variations that contribute to musical expression. Measurements of timbral variation in phrasing on the clarinet demonstrate that players control spectral and temporal properties as part of their arsenal of expressive devices. Further, mimicking instrumental variations of timbre in synthesized sound sequences increases listeners preferences compared to sequences lacking such variation (Barthet, Kronland-Martinet & Ystad, 2007). And in the realm of computer sound synthesis, there is increasing interest in continuous control of timbral attributes to enhance musical expression (Lee & Wessel, 1992; Momeni & Wessel, 2003). Larger-scale changes in timbre can also contribute to the expression of higherlevel structural functions in music. Under conditions of high blend among instruments composing a vertical sonority, timbral roughness is a major component of musical tension. However, it strongly depends on the way auditory grouping processes have parsed the incoming acoustic information into events and streams. Orchestration can play a major role in addition to pitch and rhythmic patterns in the structuring of musical tension and relaxation schemas that are an important component of the sthetic response to musical form. In the realm of electroacoustic music and in some orchestral music, timbre plays a primary grammatical role. This is particularly true in cases in which orchestration is an integral part of the compositional process, what the composer John Rea calls prima facie orchestration, rather than being a level of expression that is added after the primary structuring forces of pitch and duration have been determined, what Rea calls normative orchestration. In such cases, the structuring and sculpting of timbral changes and relations among complex auditory events provide a universe of possibilities that composers have been exploring for decades (cf. Risset, 2004), but which musicologists have only

2. Musical Timbre Perception

61

recently begun to address (Nattiez, 2007; Roy, 2003) and psychologists have yet to tackle with any scope or in any depth. Nattiez (2007) in particular has taken Meyers (1989) distinction between primary and secondary musical parameters and questioned his relegating of timbre to secondary status. In Meyers conception, primary parameters such as pitch and duration3 are able to carry syntax. Syntactic relations for Meyer are based on expectations that are resolved in closure, that is, on implications and realizations. Secondary parameters, on the other hand, are not organized in discrete units or clearly recognizable categories. According to Snyder (2000), we hear secondary parameters (among which he also includes timbre) simply in terms of their relative amounts, which are useful more for musical expression and nuance than for building grammatical structures. However, Nattiez (2007) notes that, according to his own analyses of instrumental music and those of Roy (2003) in electroacoustic music, timbre can be used to create syntactic relations that depend on expectations leading to a perception of closure. As such, the main limit of Meyers conclusion concerning timbre was that he confined his analyses to works composed in terms of pitch and rhythm and in which timbre was in effect allowed to play only a secondary functional role. This recalls Reas distinction between prima facie and normative orchestration mentioned previously. It suffices to cite the music of electroacoustic composers such as rgy Ligeti or mixed music by Trevor Dennis Smalley, orchestral music by Gyo Wishart to understand the possibilities. But even in the orchestral music of Beethoven in the high Classical period, timbre plays a structuring role at the level of sectional segmentation induced by changes in instrumentation and at the level of distinguishing individual voices or orchestral layers composed of similar timbres. As a factor responsible for structuring tension and release, timbre has been used effectively by electroacoustic composers such as Francis Dhomont and Jean-Claude Risset. According to Roys (2003) analyses, Dhomonts music, for example, uses timbre to build expectancies and deceptions in a musical context that isnt contaminated by strong pitch structures. Underlying this last remark is the implication that in a context in which pitch is a structuring force, timbre may have a hard time imposing itself as a dominant parameter, suggesting a sort of dominance hierarchy favoring rhythm and pitch when several parameters are brought into play. Research on conditions in which the different musical parameters can act in the presence of others in the perceptual structuring of music are not legion and rarely go beyond the royal couple of pitch and rhythm (see the discussion in McAdams, 1989).4 The terrain for exploring interactions among musical parameters, and thus situating their potential relative roles in bearing musical forms, will necessitate a joint effort involving musicological analysis and psychological experimentation, but it is potentially vast, rich, and very exciting.
3

He probably really meant interonset intervals, because note duration itself is probably a secondary parameter related to articulation. One exception is work by Krumhansl and Iverson (1992) showing that in the perception of sequences, there is an asymmetry in the relation between pitch and timbre such that pitch seems to be perceived more in relative terms and timbre in absolute terms.

62

Stephen McAdams

Acknowledgments
The preparation of this chapter was supported by the Natural Sciences and Engineering Research Council and the Social Sciences and Humanities Research Council of Canada and the Canada Research Chairs program.

References
Barthet, M., Kronland-Martinet, R., & Ystad, S. (2007). Improving musical expressiveness by time-varying brightness shaping. In R. Kronland-Martinet, S. Ystad, & K. Jensen (Eds.), Computer music modeling and retrieval: Sense of sounds (pp. 313336). Berlin, Germany: Springer. Bey, C., & McAdams, S. (2003). Post-recognition of interleaved melodies as an indirect measure of auditory stream formation. Journal of Experimental Psychology: Human Perception and Performance, 29, 267279. Bigand, E., Parncutt, R., & Lerdahl, F. (1996). Perception of musical tension in short chord sequences: The influence of harmonic function, sensory dissonance, horizontal motion, and musical training. Perception & Psychophysics, 58, 125141. Bigand, E., Perruchet, P., & Boyer, M. (1998). Implicit learning of an artificial grammar of musical timbres. Cahiers de Psychologie Cognitive, 17, 577600. Brant, H. (1971). Orchestration. In J. Vinton (Ed.), Dictionary of contemporary music (pp. 538546). New York, NY: E. P. Dutton. Bregman, A. S. (1990). Auditory scene analysis: The perceptual organization of sound. Cambridge, MA: MIT Press. Bregman, A. S., & Campbell, J. (1971). Primary auditory stream segregation and perception of order in rapid sequences of tones. Journal of Experimental Psychology, 89, 244249. Caclin, A., McAdams, S., Smith, B. K., & Winsberg, S. (2005). Acoustic correlates of timbre space dimensions: A confirmatory study using synthetic tones. Journal of the Acoustical Society of America, 118, 471482. Carpentier, G., Tardieu, D., Harvey, J., Assayag, G., & Saint-James, E. (2010). Predicting timbre features of instrument sound combinations: Application to automatic orchestration. Journal of New Music Research, 39, 4761. Carroll, D., & Chang, J. (1970). Analysis of individual differences in multidimensional scaling via an N-way generalization of Eckart-Young decomposition. Psychometrika, 35, 283319. Culling, J. F., & Darwin, C. J. (1993). The role of timbre in the segregation of simultaneous voices with intersecting F0 contours. Perception & Psychophysics, 34, 303309. ` ge, I. (1987). Grouping conditions in listening to music: An approach to Lerdahl & Delie Jackendoffs grouping preference rules. Music Perception, 4, 325360. ` ge, I. (1989). A perceptual approach to contemporary musical forms. Contemporary Delie Music Review, 4, 213230. Ehresman, D., & Wessel, D. L. (1978). Perception of timbral analogies, Rapports de lIRCAM (Vol. 13). Paris, France: IRCAM-Centre Pompidou. Erickson, R. (1975). Sound structure in music. Berkeley, CA: University of California Press.

2. Musical Timbre Perception

63

Eronen, A., & Klapuri, A. (2000). Musical instrument recognition using cepstral coefficients and temporal features. Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, 2, II753II756. Esling, P., Carpentier, G., & Agon, C. (2010). Dynamic musical orchestration using genetic algorithms and a spectrotemporal description of musical instruments. In C. Di Chio, et al. (Eds.), Applications of evolutionary computation, LNCS 6025 (pp. 371380). Berlin, Germany: Springer-Verlag. Fabiani, M., & Friberg, A. (2011). Influence of pitch, loudness, and timbre on the perception of instrument dynamics. Journal of the Acoustical Society of America, 130, EL193EL199. Freed, D. J. (1990). Auditory correlates of perceived mallet hardness for a set of recorded percussive events. Journal of the Acoustical Society of America, 87, 12361249. Fujinaga, I., & MacMillan, K. (2000). Realtime recognition of orchestral instruments. Proceedings of the International Computer Music Conference, Berlin (pp. 141143). San Francisco, CA: International Computer Music Association. Giordano, B. L., & McAdams, S. (2006). Material identification of real impact sounds: Effects of size variation in steel, glass, wood and plexiglass plates. Journal of the Acoustical Society of America, 119, 11711181. Giordano, B. L., & McAdams, S. (2010). Sound source mechanics and musical timbre perception: Evidence from previous studies. Music Perception, 28, 155168. Giordano, B. L., Rocchesso, D., & McAdams, S. (2010). Integration of acoustical information in the perception of impacted sound sources: The role of information accuracy and exploitability. Journal of Experimental Psychology: Human Perception and Performance, 36, 462476. Gordon, J. W. (1987). The perceptual attack time of musical tones. Journal of the Acoustical Society of America, 82, 88105. Gregory, A. H. (1994). Timbre and auditory streaming. Music Perception, 12, 161174. Grey, J. M. (1977). Multidimensional perceptual scaling of musical timbres. Journal of the Acoustical Society of America, 61, 12701277. Grey, J. M., & Gordon, J. W. (1978). Perceptual effects of spectral modifications on musical timbres. Journal of the Acoustical Society of America, 63, 14931500. Hajda, J. M., Kendall, R. A., Carterette, E. C., & Harshberger, M. L. (1997). Methodological ` ge, & J. Sloboda (Eds.), Perception and cognition issues in timbre research. In I. Delie of music (pp. 253306). Hove, U.K.: Psychology Press. Handel, S. (1995). Timbre perception and auditory object identification. In B. C. J. Moore (Ed.), Hearing (pp. 425462). San Diego, CA: Academic Press. Handel, S., & Erickson, M. (2001). A rule of thumb: The bandwidth for timbre invariance is one octave. Music Perception, 19, 121126. Handel, S., & Erickson, M. (2004). Sound source identification: The possible role of timbre transformations. Music Perception, 21, 587610. Hartmann, W. M., & Johnson, D. (1991). Stream segregation and peripheral channeling. Music Perception, 9, 155184. Helmholtz, H. L. F. von (1885). On the sensations of tone as a physiological basis for the theory of music. New York, NY: Dover. (A. J. Ellis, Trans. from the 4th German ed., 1877; republ. 1954). Henley, N. M. (1969). A psychological study of the semantics of animal terms. Journal of Verbal Learning and Verbal Behavior, 8, 176184. Iverson, P. (1995). Auditory stream segregation by musical timbre: Effects of static and dynamic acoustic attributes. Journal of Experimental Psychology: Human Perception and Performance, 21, 751763.

64

Stephen McAdams

Iverson, P., & Krumhansl, C. L. (1993). Isolating the dynamic attributes of musical timbre. Journal of the Acoustical Society of America, 94, 25952603. Kendall, R. A., & Carterette, E. C. (1991). Perceptual scaling of simultaneous wind instrument timbres. Music Perception, 8, 369404. Kendall, R. A., & Carterette, E. C. (1993). Identification and blend of timbres as a basis for orchestration. Contemporary Music Review, 9, 5167. Kendall, R. A., Carterette, E. C., & Hajda, J. M. (1999). Perceptual and acoustical features of natural and synthetic orchestral instrument tones. Music Perception, 16, 327364. Kobayashi, Y., & Osaka, N. (2008). Construction of an electronic timbre dictionary for environmental sounds by timbre symbol. Proceedings of the International Computer Music Conference, Belfast. San Francisco, CA: International Computer Music Association. risation du timbre des sons Krimphoff, J., McAdams, S., & Winsberg, S. (1994). Caracte complexes. II: Analyses acoustiques et quantification psychophysique [Characterization of the timbre of complex sounds. II: Acoustical analyses and psychophysical quantification]. Journal de Physique, 4(C5), 625628. n, & Krumhansl, C. L. (1989). Why is musical timbre so hard to understand? In S. Nielze O. Olsson (Eds.), Structure and perception of electroacoustic sound and music (pp. 4353). Amsterdam, The Netherlands: Excerpta Medica. Krumhansl, C. L., & Iverson, P. (1992). Perceptual interactions between musical pitch and timbre. Journal of Experimental Psychology: Human Perception and Performance, 18, 739751. Kruskal, J. (1964a). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 127. Kruskal, J. (1964b). Nonmetric multidimensional scaling: A numerical method. Psychometrika, 29, 115129. Lakatos, S. (2000). A common perceptual space for harmonic and percussive timbres. Perception & Psychophysics, 62, 14261439. , R. (1997). The representation of auditory source charLakatos, S., McAdams, S., & Causse acteristics: Simple geometric form. Perception & Psychophysics, 59, 11801190. Lee, M., & Wessel, D. L. (1992). Connectionist models for real-time control of synthesis and compositional algorithms. Proceedings of the 1992 International Computer Music Conference, San Jose (pp. 277280). San Francisco, CA: International Computer Music Association. Lerdahl, F., & Jackendoff, R. (1983). The generative theory of tonal music. Cambridge, MA: MIT Press. Lutfi, R. (2008). Human sound source identification. In W. Yost, A. Popper, & R. Fay (Eds.), Auditory perception of sound sources (pp. 1342). New York, NY: Springer-Verlag. , A., McAdams, S., & Winsberg, S. (2003). The dependency of Marozeau, J., de Cheveigne timbre on fundamental frequency. Journal of the Acoustical Society of America, 114, 29462957. , A. (2007). The effect of fundamental frequency on the Marozeau, J., & de Cheveigne brightness dimension of timbre. Journal of the Acoustical Society of America, 121, 383387. McAdams, S. (1989). Psychological constraints on form-bearing dimensions in music. Contemporary Music Review, 4(1), 181198. McAdams, S. (1993). Recognition of sound sources and events. In S. McAdams, & E. Bigand (Eds.), Thinking in sound: The cognitive psychology of human audition (pp. 146198). Oxford, U.K.: Oxford University Press.

2. Musical Timbre Perception

65

McAdams, S., & Bregman, A. S. (1979). Hearing musical streams. Computer Music Journal, 3(4), 2643. McAdams, S., & Cunibile, J.-C. (1992). Perception of timbral analogies. Philosophical Transactions of the Royal Society, London, Series B, 336, 383389. McAdams, S., & Misdariis, N. (1999). Perceptual-based retrieval in large musical sound databases. In P. Lenca (Ed.), Proceedings of Human Centred Processes 99, Brest (pp. 445450). Brest, France: ENST Bretagne. McAdams, S., & Rodet, X. (1988). The role of FM-induced AM in dynamic spectral profile analysis. In H. Duifhuis, J. W. Horst, & H. P. Wit (Eds.), Basic issues in hearing (pp. 359369). London, England: Academic Press. McAdams, S., Chaigne, A., & Roussarie, V. (2004). The psychomechanics of simulated sound sources: Material properties of impacted bars. Journal of the Acoustical Society of America, 115, 13061320. McAdams, S., Depalle, P., & Clarke, E. (2004). Analyzing musical sound. In E. Clarke, & N. Cook (Eds.), Empirical musicology: Aims, methods, prospects (pp. 157196). New York, NY: Oxford University Press. McAdams, S., Roussarie, V., Chaigne, A., & Giordano, B. L. (2010). The psychomechanics of simulated sound sources: Material properties of impacted plates. Journal of the Acoustical Society of America, 128, 14011413. McAdams, S., Winsberg, S., Donnadieu, S., De Soete, G., & Krimphoff, J. (1995). Perceptual scaling of synthesized musical timbres: Common dimensions, specificities, and latent subject classes. Psychological Research, 58, 177192. Meyer, L. B. (1989). Style and music: Theory, history, and ideology. Philadelphia, PA: University of Pennsylvania Press. Miller, J. R., & Carterette, E. C. (1975). Perceptual space for musical structures. Journal of the Acoustical Society of America, 58, 711720. Momeni, A., & Wessel, D. L. (2003). Characterizing and controlling musical material intuitively with geometric models. In F. Thibault (Ed.), Proceedings of the 2003 Conference on New Interfaces for Music Expression, Montreal (pp. 5462). Montreal, Canada: McGill University. Moore, B. C. J., & Gockel, H. (2002). Factors influencing sequential stream segregation. Acustica United with Acta Acustica, 88, 320332. ` tre secondaire? [Is timbre a Nattiez, J. -J. (2007). Le timbre est-il un parame te Que be coise de Recherche en Musique, secondary parameter?]. Cahiers de la Socie 9(12), 1324. Opolko, F., & Wapnick, J. (2006). McGill University master samples [DVD set]. Montreal, Canada: McGill University. Paraskeva, S., & McAdams, S. (1997). Influence of timbre, presence/absence of tonal hierarchy and musical training on the perception of tension/relaxation schemas of musical phrases. Proceedings of the 1997 International Computer Music Conference, Thessaloniki (pp. 438441). San Francisco, CA: International Computer Music Association. Parncutt, R. (1989). Harmony: A psychoacoustical approach. Berlin, Germany: SpringerVerlag. ` re, C. (1995). Time-domain modeling of peripheral Patterson, R. D., Allerhand, M., & Gigue auditory processing: A modular architecture and a software platform. Journal of the Acoustical Society of America, 98, 18901894. Peeters, G., McAdams, S., & Herrera, P. (2000). Instrument sound description in the context of MPEG-7. Proceedings of the 2000 International Computer Music Conference, Berlin (pp. 166169). San Francisco, CA: International Computer Music Association.

66

Stephen McAdams

Peeters, G., Giordano, B. L., Susini, P., Misdariis, N., & McAdams, S. (2011). The Timbre Toolbox: Extracting audio descriptors from musical signals. Journal of the Acoustical Society of America, 130, 29022916. Plomp, R. (1970). Timbre as a multidimensional attribute of complex tones. In R. Plomp, & G. F. Smoorenburg (Eds.), Frequency analysis and periodicity detection in hearing (pp. 397414). Leiden, The Netherlands: Sijthoff. Plomp, R. (1976). Aspects of tone sensation: A psychophysical study. London, UK: Academic Press. Risset, J.-C. (2004). Timbre. In J.-J. Nattiez, M. Bent, R. Dalmonte, & M. Baroni (Eds.), die pour le XXIe sie `cle. Vol. 2.: Les savoirs musicaux Musiques. Une encyclope [Musics. An encyclopedia for the 21st century. Vol. 2: Musical knowledge] (pp. 134161). Paris, France: Actes Sud. Risset, J. -C., & Wessel, D. L. (1999). Exploration of timbre by analysis and synthesis. In D. Deutsch (Ed.), The psychology of music (2nd ed., pp. 113168). San Diego, CA: Academic Press. ` lorchestration contempoRose, F., & Hetrick, J. (2007). Lanalyse spectrale comme aide a te raine [Spectral analysis as an aid for contemporary orchestration]. Cahiers de la Socie be coise de Recherche en Musique, 9(12), 6368. Que lectroacoustiques: Mode `les et propositions [The Roy, S. (2003). Lanalyse des musiques e analysis of electroacoustic music: Models and proposals]. Paris, France: LHarmattan. Rumelhart, D. E., & Abrahamson, A. A. (1973). A model for analogical reasoning. Cognitive Psychology, 5, 128. Saldanha, E. L., & Corso, J. F. (1964). Timbre cues and the identification of musical instruments. Journal of the Acoustical Society of America, 36, 20212126. Sandell, G. J. (1989). Perception of concurrent timbres and implications for orchestration. Proceedings of the 1989 International Computer Music Conference, Columbus (pp. 268272). San Francisco, CA: International Computer Music Association. Sandell, G. J. (1995). Roles for spectral centroid and other factors in determining blended instrument pairings in orchestration. Music Perception, 13, 209246. Schoenberg, A. (1978). Theory of harmony. Berkeley, CA: University of California Press. (R. E. Carter, Trans. from original German edition, 1911). Singh, P. G., & Bregman, A. S. (1997). The influence of different timbre attributes on the perceptual segregation of complex-tone sequences. Journal of the Acoustical Society of America, 120, 19431952. Slawson, W. (1985). Sound color. Berkeley, CA: University of California Press. Snyder, B. (2000). Music and memory: An introduction. Cambridge, MA: MIT Press. Steele, K., & Williams, A. (2006). Is the bandwidth for timbre invariance only one octave? Music Perception, 23, 215220. Tardieu, D., & McAdams, S. (in press). Perception of dyads of impulsive and sustained instrument sounds. Music Perception. Tillmann, B., & McAdams, S. (2004). Implicit learning of musical timbre sequences: Statistical regularities confronted with acoustical (dis)similarities. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30, 11311142. Traube, C., Depalle, P., & Wanderley, M. (2003). Indirect acquisition of instrumental gesture based on signal, physical and perceptual information. In F. Thibault (Ed.), Proceedings of the 2003 Conference on New Interfaces for Musical Expression, Montreal (pp. 4247). Montreal, Canada: McGill University. Vurma, A., Raju, M., & Kuuda, A. (2011). Does timbre affect pitch? Estimations by musicians and non-musicians. Psychology of Music, 39, 291306.

2. Musical Timbre Perception

67

Wessel, D. L. (1973). Psychoacoustics and music: A report from Michigan State University. PACE: Bulletin of the Computer Arts Society, 30, 12. Wessel, D. L. (1979). Timbre space as a musical control structure. Computer Music Journal, 3(2), 4552. Wessel, D. L., Bristow, D., & Settel, Z. (1987). Control of phrasing and articulation in synthesis. Proceedings of the 1987 International Computer Music Conference, Champaign/Urbana (pp. 108116). San Francisco, CA: International Computer Music Association. Winsberg, S., & Carroll, D. (1989). A quasi-nonmetric method for multidimensional scaling via an extended Euclidean model. Psychometrika, 54, 217229. Winsberg, S., & De Soete, G. (1993). A latent class approach to fitting the weighted Euclidean model, CLASCAL. Psychometrika, 58, 315330. Winsberg, S., & De Soete, G. (1997). Multidimensional scaling with constrained dimensions: CONSCAL. British Journal of Mathematical and Statistical Psychology, 50, 5572. Wright, J. K., & Bregman, A. S. (1987). Auditory stream segregation and the control of dissonance in polyphonic music. Contemporary Music Review, 2(1), 6392.

3 Perception of Singing
Johan Sundberg
Department of Speech, Music, and Hearing, KTH (Royal Institute of Technology), Stockholm, Sweden

I.

Introduction

Understanding of the perception of singing may emerge from two types of investigation. One type concerns acoustic properties of singing, which are systematically varied and perceptually examined. Such investigations are rare. Another type of investigation compares acoustic characteristics of various types of voices or phonations, such as classical versus belt styles or pressed versus normal phonation. As such classifications must be based on auditory perceptual cues, these investigations are perceptually relevant. Many investigations of singing possess this type of perceptual relevance. Research on the perception of singing is not as developed as is the closely related field of the perception of speech. Therefore, an exhaustive presentation cannot be made here. Rather, a number of different investigations that are only partly related to one another are reviewed. When we listen to a singer, we can note a number of remarkable perceptual phenomena that raise a number of different questions. For instance: How is it that we can hear the voice even when the orchestra is loud? How is it that we generally identify the singers vowels correctly even though vowel quality in singing differs considerably from what we are used to in speech? How is it that we can identify the individual singers sex, register, and voice timbre when the pitch of the vowel lies within a range that is common to all singers and several registers? How is it that we perceive singing as a sequence of discrete pitches, even though the fundamental frequency (F0) events do not form a pattern of discrete fundamental frequencies? These are some of the main questions that are discussed in this chapter. First, however, a brief overview of the acoustics of the singing voice is presented.

II.

Voice Function

The theory of voice production, schematically illustrated in Figure 1, was formulated by Fant (1960). The voice-producing system consists of three basic
The Psychology of Music. DOI: http://dx.doi.org/10.1016/B978-0-12-381460-9.00003-1 2013 Elsevier Inc. All rights reserved.

70

Johan Sundberg

RADIATED SPECTRUM Level

Frequency
CAL T VO

Velum

VOCAL TRACT Frequency curve Level

CT RA

Formants

Frequency VOICE SOURCE Spectrum Level Vocal folds Trachea Lungs Frequency Transglottal airflow Waveform

Time

Figure 1 Schematic illustration of voice function. The voice source is the pulsating transglottal airflow, which has a spectrum with harmonic partials, the amplitudes of which decrease monotonically with rising frequency. This signal is injected into the vocal tract, which is a resonator with resonances, called formants. They produce peaks in the frequency curve of the vocal tract. Partials lying close to formants are enhanced and become stronger than other partials that lie farther away from a formant.

components: (1) the respiratory system that provides an excess pressure of air in the lungs, (2) the vocal folds that chop the air stream from the lungs into a sequence of quasi-periodic air pulses, and (3) the vocal tract that gives each sound its characteristic final spectral shape and thus its timbral identity. These three components are referred to as (1) respiration, (2) phonation, and (3) vocal tract shaping (articulation) and resonance, respectively. The larynx also provides a whisper sound source, and the vocal tract also provides articulation of consonants, but these components are not discussed here.

3. Perception of Singing

71

The chopped transglottal air stream is called the voice source. It is the raw material of all voiced sounds. It can be described as a complex tone composed of a number of harmonic partials. This implies that the frequency of the nth partial equals n times the frequency of the first partial, which is called the fundamental frequency (henceforth F0) or the first harmonic. The F0 is identical with the number of air pulses occurring in 1 second, or in other words, to the vibration frequency of the vocal folds. F0 determines the pitch that we perceive in the sense that the pitch would remain essentially the same, even if the fundamental sounded alone. The amplitudes of the voice-source partials decrease monotonically with increasing frequency. For medium vocal loudness, a given partial is 12 dB stronger than a partial located one octave higher; for softer phonation, this difference is greater. On the other hand, the slope of the voice-source spectrum is generally not dependent on which voiced sound is being produced. Spectral differences between various voiced sounds arise when the sound of the voice source is transferred through the vocal tract (i.e., from the vocal folds to the lip opening). The reason for this is that the ability of the vocal tract to transfer sound is highly dependent on the frequency of the sound being transferred. This ability is greatest at the resonance frequencies of the vocal tract. Vocal tract resonances are called formants. Those voice-source partials that lie closest to the resonance frequencies are radiated from the lip opening at greater amplitudes than other partials. Hence, the formant frequencies are manifested as peaks in the spectrum of the radiated sound. The shape of the vocal tract determines the formant frequencies, which can be varied within rather wide limits in response to changes in the positions of the articulators (i.e., lips, tongue body, tongue tip, lower jaw, velum, pharyngeal sidewalls, and larynx). Thus, the two lowest formant frequencies F1 and F2 can be changed over a range of two octaves or more, and they determine the identity of most vowels, that is, the vowel quality. The higher formant frequencies cannot be varied as much and do not contribute much to vowel quality. Rather, they signify personal voice timbre. Vowel quality is often described in a chart showing the frequencies of F1 and F2, as in Figure 2. Note that each vowel is represented by a small area rather than by a point on the chart. In other words, F1 and F2 can be varied within certain limits without changing the identity of the vowel. This reflects the fact that a given vowel normally possesses higher formant frequencies in children and in females than in males. The reason for these differences lies in differing vocal tract dimensions, as will be shown later.

III.

Phonation

Voice quality can be varied to a considerable extent by means of laryngeal adjustments that affect the voice source. In the present section, some aspects of these effects are described.

72

Johan Sundberg

3000

Figure 2 Ranges of the two lowest formant frequencies for the indicated vowels represented by their symbols in the International Phonetic Alphabet. Above, the frequency scale of the first formant is given in musical notation.

2500 Second formant frequency (Hz) heet /i/ 2000 bet // at // 1500 /a/ cut her // // hard // cause /u/ /o/ all boot

1000

500

600 800 1000 200 400 First formant frequency (Hz)

A. Loudness, Pitch, and Phonation Type


Vocal loudness is typically assumed to correspond to intensity in decibels (dB) sound pressure level (SPL). Near the sound source in reverberant rooms, SPL decreases with increasing distance; SPL is obviously higher at shorter than at longer distances. Hence, SPL values are meaningful only when measured at a specified distance. Sound intensity is often measured at a distance of 30 cm from the lip opening. SPL has a rather complex relation to perceived loudness. The SPL of a vowel mostly reflects the strength of a single partial, namely, the strongest partial in the spectrum (Sundberg & Gramming, 1988; Titze, 1992). Except for high-pitched vowels and vowels that are produced by a very soft voice, that partial is an overtone. This overtone is normally the one that is closest to F1, so vowels produced with the same effort may vary substantially depending on the F0 and vowel. It is also a common experience that a variation of SPL, such as occurs in varying the listening level of a recording, is not perceived as a variation of vocal loudness. Rather, such variation sounds more like a change in the distance of the microphone from the source.

3. Perception of Singing
200 Log of mean loudness ratings Log of mean loudness ratings 200

73

150

150

100

a o e

100

50

i u

50

0 70 80 90 100 Sound level (dB)

0 1 10 Subglottal pressure (cm H2O) 100

Figure 3 Sound pressure level (SPL) and mean loudness ratings of the indicated vowels produced at different degrees of vocal loudness, plotted as a function of SPL and subglottal pressure (left and right panels, respectively). Data from Ladefoged and McKinney (1963).

If SPL is not closely correlated with perceived vocal loudness, what is it that determines loudness perception for voices? As shown by Ladefoged and McKinney (1963) and illustrated in Figure 3, the average rated loudness of vowels is more closely related to the underlying subglottal pressure than to SPL. The reason would be that we vary vocal loudness by means of subglottal pressure; the higher the pressure, the greater the perceived vocal loudness. Variation of subglottal pressure also causes changes in voice-source characteristics other than the overall SPL. In the voice (as in most other music instruments), the amplitudes of the higher overtones increase at a faster rate than do the amplitudes of the lower overtones, when vocal loudness is increased. This is illustrated in Figure 4, which shows the average spectrum of a voice produced by reading text at different degrees of vocal loudness. Thus, in both speech and singing, perceived vocal loudness increases with the spectral dominance of the higher overtones. In baritone singers, a 10-dB increase in the overall intensity results in a 16-dB increase in the partials near lander & Sundberg, 2004). In female classically trained singers, this 3 kHz (Sjo gain varies with F0 (for a literature review, see Collyer, Davis, Thorpe, & Callaghan, 2009). The amplitude of the voice source fundamental is another important voice characteristic. It varies depending on mode of phonation, which, in turn, is strongly influenced by glottal adduction (the force by which the vocal folds are pressed against each other). It is often specified in terms of the level difference between partials 1 and 2 of the source spectrum, and referred to as H1-H2 (Harmonic 1Harmonic 2). When adduction is weak (extreme: breathy/ hypofunctional phonation), the fundamental is stronger than when the adduction is firm (extreme: pressed/hyperfunctional phonation). Acoustically, H1-H2 is closely correlated with the peak-to-peak amplitude of the transglottal airflow pulses. As illustrated in Figure 5, classically trained baritone singers have an average

74

Johan Sundberg

30 Average spectrum level (dB) 40 50 60 70 80 100 16dB 22dB

1000 Frequency (Hz)

10000

Figure 4 Long-term-average spectra of a male untrained voice reading the same text at different degrees of vocal loudness. Data from Nordenberg and Sundberg (2004).

H1-H2 as high as almost 25 dB for very soft phonation (low subglottal pressure), while for loudest phonation it is only 7.5 dB (Sundberg, Andersson, & Hultqvist, 1999). For a given relative subglottal pressure, the fundamental in male musical rkner, theatre singers tends to be weaker, as can be seen in the same graph (Bjo 2008). When glottal adduction is reduced to the minimum that still produces a nonbreathy type of phonation, flow phonation results, in which both the voicesource fundamental and the higher overtones are strong. Nonsingers tend to change phonation characteristics with pitch and loudness, so that high and/or loud tones are produced with a more pressed phonation than lower tones. Classically trained singers, on the other hand, seem to avoid such automatic changes of phonation. The amplitudes of the transglottal airflow pulses are influenced by the glottal area. This means that they are dependent on vocal fold length, among other things; for a given vibration amplitude, longer vocal folds open a greater glottal area than shorter folds. Therefore, at a given pitch and for a given subglottal pressure, a singer with long vocal folds should produce tones with a larger peakto-peak amplitude of the transglottal airflow and hence a stronger voice-source fundamental than does a singer with shorter vocal folds. As low voices have longer vocal folds than higher voices, we may expect that the amplitude of the fundamental should also be included among the characteristics of the different voice categories. This probably helps us to hear whether an individual phonates in the upper, middle, or lower part of his or her pitch range. Another important difference between voice classifications is the formant frequencies, as we will see later.

3. Perception of Singing

75

25 F0 = 110 Hz 20 Classically trained opera singers Musical theatre singers

H1-H2 (dB)

15

10

0 10 40 50 60 70 80 90 20 30 Normalized subglottal pressure (% of individual range) 100

Figure 5 Mean H1-H2 values observed at the indicated pitch as a function of subglottal pressure normalized with regard to the total pressure range that the singers used for this pitch. rkner (2008). Data from Bjo

In summary, apart from pitch, there are two main aspects of vowel sounds that can be varied rather independently: the amplitude of the fundamental, which is strongly dependent upon glottal adduction, and the amplitude of the overtones, which is controlled by subglottal pressure. In nonsingers voices, glottal adduction is typically increased with pitch and vocal loudness. Singers appear to avoid such automatic changes in voice source that accompany changes in pitch or loudness. They need to vary voice timbre for expressive rather than for physiological reasons. Thus, they can be said to orthogonalize phonatory dimensions.

B. Register
Register, in some literature also called mechanism, is an aspect of phonation that has been the subject of considerable research, yet this terminology has remained unclear (see e.g., Henrich, 2006). It is generally agreed that a register is a series of adjacent scale tones that (a) sound equal in timbre and (b) are felt to be produced in a similar way. Further, it is generally agreed that register differences reflect differences in the mode of vibration of the vocal folds. A striking example of the register concept is the contrast between the modal and falsetto registers in the male voice. Typical of the transition from one register to another is that it is often, though not necessarily, associated with a jump in pitch. In the male voice, there are at least three registers, vocal fry, modal, and falsetto. They cover the lowest, the middle, and the top pitch ranges of the voice. The female singing voice is often assumed to contain four registers: chest, middle,

76

Johan Sundberg

head, and whistle. They cover the lowest, the lower middle, the upper middle, and the top part of the pitch range, respectively. The vocal fry register, sometimes called pulse register, often occurs in phrase endings in conversational speech. The pitch ranges of registers overlap, as illustrated in Figure 6. It should also be mentioned that many voice experts suggest that there are only two registers in both male and female voices: heavy and light, or modal and falsetto. Physiologically, registers are associated with characteristic voice-source properties, that is, they are produced from specific vocal fold vibration properties. In vocal fry, the folds are short, lax, and thick. The transglottal flow pulses often come in groups of two or more, or with long time intervals, such that the fundamental period is very long, typically well below 60 Hz. In modal register, the vocal folds are still short and thick, but the flow pulses come one by one and the glottis is typically closed during about 20%50% of the period in loud phonation. In falsetto, the vocal folds are thin, tense, and long, and typically the glottis does not close completely during the vibratory cycle. These variations result in differing acoustic characteristics that are perceptually relevant. In vocal fry, the fundamental is very weak, in modal it is much stronger, and in falsetto it is often the strongest partial in the radiated spectrum. This is largely a consequence of the duration of the closed phase combined with the amplitude of the flow pulse. Female singers and also countertenor singers mostly need to use both their lower modal/chest register and their upper falsetto/head register, and the register transitions need to be as inconspicuous as possible. Thus, the timbral differences between these registers need to be reduced to a minimum. This goal appears to be achieved by a refined function of the pitch-regulating muscles, the cricothyroid and vocalis. The vocalis muscle is located in the vocal fold, running parallel to it, and when contracted, it strives to shorten and thicken the fold. The cricothyroid muscle has the antagonistic function, striving to stretch and thin the folds. When the vocalis muscle suddenly stops contracting, the pitch suddenly rises and the register switches from modal to falsetto (such as in yodeling), causing a marked
Vocal fry Vocal fry Modal Chest Middle/Mixed/Head Falsetto Whistle

A2 110 Hz

C4 Keyhole

A4 440 Hz

A5 880 Hz

A6 1760 Hz

Figure 6 Approximate ranges of the indicated vocal registers in female and male voices (upper and lower rows).

3. Perception of Singing

77

timbral contrast. A more gradual fading out of vocalis contraction with rising pitch is probably the technique used to achieve the more gradual transition that singers need. This is actually implicitly suggested by the fact that the term mixed register is often offered for the register that female voices use in the pitch range of roughly E4 to E5. Glottal leakage, that is, flow through the glottis that is not modulated by the vibrating vocal folds, is mostly associated with falsetto phonation in untrained voices. Singers, by contrast, appear to avoid or minimize it. Thus, both countertenors singing in their falsetto register and classically trained females singing in their middle/mixed register sometimes have been found to phonate with complete glottal dersten, 1988). closure (Lindestad & So

IV.

Resonance

A. Formant Frequencies at High Pitches


Most singers are required to sing at F0 values higher than those used in normal speech; the average voice F0 of male and female adults is about 110 Hz and 200 Hz, rarely exceeding about 200 Hz and 400 Hz, respectively. Thus, in speech, F1 is normally higher than F0. In singing, the highest pitches for soprano, alto, tenor, baritone, and bass correspond to F0 values of about 1050 Hz (pitch C6), 700 Hz (F5), 520 Hz (C5), 390 Hz (G4), and 350 Hz (F4), respectively. Hence, the normal value of F1 of many vowels is often much lower than the singers F0, as can be seen in Figure 2. If the singer were to use the same articulation and formant frequencies in singing as in speech, the situation illustrated in the upper part of Figure 7 would occur. The fundamental frequency, that is, the lowest partial in the spectrum, would appear at a frequency far above that of the first formant frequency (F1). In other words, capability of the vocal tract to transfer sound would be wasted at a frequency where there is no sound to transfer. Singers avoid this situation. The strategy is to abandon the formant frequencies of normal speech and move F1 close to F0 (Garnier, Henrich, Smith, & Wolfe, 2010; Henrich, Smith, & Wolfe, 2011; Sundberg, 1975). A commonly used method for reaching this effect seems to be to reduce the maximum constriction of the vocal tract and then to widen the jaw opening (Echternach et al., 2010; Sundberg, 2009). Both these modifications tend to raise F1 (cf. Lindblom & Sundberg, 1971). This explains why female singers, in the upper part of their pitch range, tend to change their mouth opening in a pitch-dependent manner rather than in a voweldependent manner, as in normal speech. The acoustic result of this strategy is illustrated in the lower part of Figure 7. The amplitude of the fundamental, and hence the overall SPL of the vowel, increases considerably. Note that this SPL gain results from a resonatory phenomenon, obtained without an increase in vocal effort. Figure 8 shows formant frequencies measured in a soprano singing various vowels at different pitches. As can be seen from the figure, the singer maintained

78

Johan Sundberg

Formants Amplitude

Partials Frequency Formants

Figure 7 Schematic illustration of the formant strategy in high-pitched singing. In the upper case, the singer has a small jaw opening so that F0 becomes higher than F1. The result is a low amplitude of the fundamental. In the lower case, the jaw opening is widened so that F1 is raised to a frequency near F0. The result is a considerable gain in amplitude of the fundamental. Reprinted from Sundberg (1977a).

Amplitude

Partials Frequency

the formant frequencies of normal speech up to that pitch where F0 came close to F1. Above this pitch, F1 was raised to a frequency in the vicinity of F0. Which singers use this pitch-dependent formant strategy? The strategy has been documented in soprano singers (Johansson, Sundberg, & Wilbrand, 1985; Joliveau, Smith, & Wolfe, 2004; Sundberg, 1975) but it is adopted also in other cases, where the singer sings at an F0 higher than the normal value of F1 (Henrich et al., 2011). Consulting Figure 2 once again, we find that for bass and baritone voices, most vowels have an F1 higher than their top F0. For tenors and altos, the same applies to some vowels only, and for sopranos, to few vowels. Thus, the pitch-dependent formant strategy can be assumed to be applied by bass and baritone singers for some vowels in the top of their pitch ranges, by tenors for some vowels in the upper part of their pitch range, by altos for many vowels in the upper part of their pitch range, and by sopranos for most vowels over most of their pitch range. A study of the jaw openings of professional classically trained singers for different vowels sung at different pitches basically confirmed these assumptions for the vowels //1 and /a/, but for front vowels such as /i/ and /e/, the strategy seemed to be first to widen the tongue constriction and then to widen the jaw opening (Sundberg & Skoog, 1997). A widening of the jaw opening affects F1 in the first place, but higher formant frequencies also are affected. This is also illustrated in Figure 8; all formant frequencies change when F1 approaches the vicinity of F0.

All characters appearing within // are symbols in the International Phonetic Alphabet.

3. Perception of Singing

79

4.0
e e i a u i e a u e i a u i a u

i a e

F4

3.0 Formant frequency (kHz)

i i a e a ue u

i ea u

a e i u ei a u

F3 2.0
e i e i ei i e ei a u a a a au u u a e u i u a e u i a e ui a eu i i ae u

Figure 8 Formant frequencies of the indicated vowels (symbols from International Phonetic Alphabet) measured in a professional soprano singing different vowels at different pitches. The lines show schematically how she changed the formant frequencies with pitch. The values represented by circled symbols were observed when the subject sustained them in a speech mode. After Sundberg (1975).

F2

1.0

F1

0 0 200 400 600 Fundamental frequency (Hz) 800

B. The Singers Formant Cluster


Although female singers gain a good deal of sound level by tuning their F1 to the vicinity of F0, male classically trained singers have to use an entirely different resonance strategy. Just singing very loudly would not help them to make their voices heard when accompanied by a loud orchestra. The reason is that male speech has an average distribution of sound energy similar to that of our symphonic orchestras (see Figure 9). Therefore, a loud orchestra would very likely mask a male singers voice if it had the same spectral properties as in speech. However, tenors, baritones, and basses produce spectra in which the partials falling in the frequency region of approximately 2.53 kHz, are greatly enhanced, producing a marked peak in the spectral envelope. Figure 10 compares typical examples of the vowel /u/ produced in speech and in singing by a professional singer. This peak has generally been referred to as the singers formant or the singing formant (see later). It has been observed in most acoustic studies of tenors, baritones, and basses (see, e.g., Bartholomew, 1934; Hollien, 1983; Rzhevkin, 1956; Seidner, Schutte, Wendler, & Rauhut, 1985; Sundberg, 1974; Winckel, 1953). It has been

80

Johan Sundberg

Mean level (dB)

10

20

Mean of orchestral music Mean male speech, 80 dB@0.3m

30 100

1000 Frequency (Hz)

10000

Figure 9 Long-term-average spectra showing the typical distribution of sound energy in Western symphonic orchestras and in normal speech (dashed and solid curves).

Sung

Level (10 dB/division)

Spoken

1000

2000 3000 Frequency (Hz)

4000

5000

Figure 10 Spectra of a spoken and sung vowel /u/ (thin and heavy curves). The peak between 2.5 and 3 kHz is called the singers formant cluster.

found to correlate with ratings of a perceptual quality termed resonance/ring (Ekholm, Papagiannis, & Chagnon, 1998). As will be explained later, it makes the voice more audible in the presence of a loud orchestral accompaniment. When formants approach each other in frequency, the ability of the vocal tract to transfer sound increases in the corresponding frequency region. In fact, the spectral peak of the singers formant can be explained as the acoustic consequence of clustering F3, F4, and F5 (Sundberg, 1974). Therefore, it is hereafter referred to as the singers formant cluster. Its amplitude depends on how closely these formants are clustered, and, of course, also on subglottal pressure, that is, vocal loudness.

3. Perception of Singing

81

Formant frequencies are determined by the dimensions of the vocal tract, that is, by articulation, as mentioned. An articulatory configuration that clusters F3, F4, and F5 in such a way that a singers formant cluster is generated involves a wide pharynx (Sundberg, 1974). Such a widening can probably be achieved by a lowering of the larynx, and a low larynx position is typically observed in male classically trained singers (Shipp & Izdebski, 1975). Thus, the singers formant cluster can be understood both acoustically and articulatorily. The center frequency of the singers formant cluster varies slightly between voice classifications. This was demonstrated in terms of long-term-average spectra (LTAS) by Dmitriev and Kiselev (1979). For basses, the center frequency was found to lie near 2.3 kHz, and for tenors, near 2.8 kHz. These findings were later corroborated by several researchers (Bloothooft & Plomp, 1988; Ekholm et al., 1998; Sundberg, 2001). The variation is small but has been found to be perceptually relevant in a listening test involving experts who were asked to determine the classification of synthesized stimuli (Berndtsson & Sundberg, 1995). It seems essential that the intensity of the singers formant cluster should not vary too much from one vowel to the other. In neutral speech, the level of F3 typically may differ by almost 30 dB between an /i/ and an /u/ because of the great difference in F2, resulting in a great difference in the proximity between F2 and F3 in these vowels (see Figure 2). Male classically trained singers densely cluster F3, F4, and F5 in /u/, while their F2 in /i/ is much lower than in speech. As a consequence, the level of the singers formant cluster in /i/ is much more similar to that of a /u/ in singing than in speech (Sundberg, 1990). One might regard the singers formant cluster as something like a timbral uniform cap for sung vowels that should increase the similarity in voice quality of the vowels. This would help singers to achieve a legato effect in phrases containing different vowels. A singers formant cluster has not been found in sopranos (Seidner et al., 1985; Weiss, Brown, & Morris, 2001). There may be several reasons for this. One may be purely perceptual. The basic principle for producing a singers formant cluster is that F3, F4, and F5 are concentrated in a rather narrow frequency range. In highpitched singing, the frequency distance between partials is obviously large (i.e., equal to F0). A soprano who clustered these higher formants would then produce vowels with a singers formant cluster only at pitches where a partial happened to fall into the frequency range of the cluster. For some tones, there would be no such partial, and such tones would sound different from those where there was a partial hitting the cluster. As mentioned, large differences in voice quality between adjacent tones in a phrase do not seem compatible with legato singing. Singers in the pop music genres do not sing with a singers formant cluster. Rather, some of them have been observed to produce a considerably lower spectral peak in the frequency range 3.23.6 kHz (Cleveland, Sundberg, & Stone, 2001). Such a peak has also been observed in some professional speakers, such as radio announcers and actors, and in what has been referred to as good voices (Leino, Laukkanen, & Leino, 2011). This peak seems to result from a clustering of F4 and F5, combined with a voice-source spectrum that produces harmonic partials in this frequency range.

82

Johan Sundberg

The singers formant cluster is readily recognized among voice experts. However, many terms are used for it. Vennard, an eminent singing teacher and voice researcher, simply refers to it as the 2800 Hz that produces the ring of the voice (Vennard, 1967). It seems that the German term Stimmsitz, when used to refer to male, classically trained voices, is associated with a singers formant cluster that is present in all vowels and at all pitches (W. Seidner, personal communication, 2011).

C. Modification of Vowel Quality


The deviations from the formant frequencies typical of normal speech that are produced by classically trained female singers at high pitches are quite substantial and imply considerable modification of vowel quality. However, the production of the singers formant cluster also is associated with modifications of the vowel qualities that are typical of normal speech. The reason is that the required widening of the pharynx and the lowering of the larynx also affect F1 and F2. Sundberg (1970) measured formant frequencies in vowels sung by four singers and compared these frequencies with formant frequencies reported for nonsingers by Fant (1973). As shown in Figure 11, there are considerable differences between the two. For instance, F2 in front vowels such as /i/ and /e/ does not reach as high a frequency in singing as in speech. As a result, some vowels in singing assume formant frequencies that are typical of a different vowel in speech. For example, F2 of a sung /i/ is almost the same as F2 of a spoken /y/. The differences in quality between spoken and sung vowels are well known to singers and teachers of singing. Thus, students of singing are often advised to modify or color an /i:/ toward a /y:/, an /e:/ toward an //, an /a:/ toward an /:/ etc. (see e.g., Appelman, 1967). A common strategy for male voices is to cover the vowels in the upper part of the male range or to use formant tuning (Doscher, 1994; Miller, 2008). This appears to imply that F1 is lowered in vowels that normally have a high F1, such as /a/ and /ae/. Yet, it is considered important that perceptually singers should not replace, but just modify a vowel toward another vowel. This would mean that the sung vowels should retain their perceptual vowel identity, although F1 and F2 are somewhat unusual. Compared with singers who are classically trained, singers in pop music genres appear on the whole to produce much smaller deviations from the vowel qualities used in normal speech. However, departures have been observed in some nonclassical genres also. For example, in a single subject study of the vocal style referred to as twang (a voice timbre produced by particularly strong high partials), F2 was observed to be on average about 10% higher, and F3 about 10% lower, than in a n, 2010). neutral mode of singing (Sundberg & Thale

D. Voice Classification
Singing voices are classified into six main groups: soprano, mezzo-soprano, alto, tenor, baritone, and bass. There are also commonly used subgroups, such as

3. Perception of Singing

83

4.0 3.5 F4 3.0 Formant frequencies (kHz) 2.5 F3 2.0 1.5 1.0 0.5 0 /u:/ /o:/ /a:/ /:/ /e:/ /i:/ /y:/ /u :/ /:/ F2 F5 F4

F1

Vowel (IPA symbols)

Figure 11 Average formant frequencies in the indicated vowels as produced by nonsingers (dashed curves, according to Fant, 1973) and by four bass/baritone singers (solid curves, according to Sundberg, 1970). Note that the nonsingers F4 is slightly higher than the singers F5. From Sundberg (1974).

dramatic as opposed to lyric, or spinto, coloratura, soubrette, and so on. The main criterion for this classification is the singers comfortable pitch range. If a singers range is C3 to C5 (131523 Hz), his classification is tenor. These ranges overlap to some extent, and the range C4 to E4 (262330 Hz) is actually common to all voice classifications. Nevertheless, even if we hear a voice singing in this narrow pitch range, we can generally hear whether it is a male or a female voice, and experts can mostly even identify the voice classification. Cleveland (1977) studied the acoustic background of this classification ability with regard to male singers. He presented five vowels sung by eight professional singersbasses, baritones, or tenorsto singing teachers who were asked to classify the voices. The results revealed that the major acoustic cue in voice classification was F0. This is not very surprising, if we assume that the listeners relied on the most apparent acoustic characteristic in the first place. By comparing vowels sung at the same pitches, however, Cleveland found that the formant frequencies served as a secondary cue. The trend was that the lower the formant frequencies, the lower the pitch range the singer was assumed to possess. In other words, low

84

Johan Sundberg

formant frequencies seemed to be associated with bass singers and high formant frequencies with tenors. In a subsequent listening test, Cleveland verified these results by presenting the same singing teachers with vowels synthesized with formant frequencies that were varied systematically in accordance with his results from the test that used real vowel sounds. rbe, and Sundberg (2009) analyzed x-ray profiles of 132 singers who Roers, Mu r Musik in Dresden, were accepted for solo singer education at the Hochschule fu Germany, and measured their vocal tract dimensions and vocal fold lengths. Their findings corroborated those earlier reported by Dmitriev and Kiselev (1979) that low voices tend to have long vocal tracts and vice versa. They further observed that this difference depends mainly on the length of the pharynx cavity. Thus, sopranos tend to have the shortest pharynges and basses the longest. They also noted that the vocal folds typically were shorter in classifications with a higher pitch range and longer in classifications with a lower pitch range. This suggests that at a given pitch, singers with a higher pitch range should tend to have a weaker voice-source fundamental than singers with a lower pitch range, as mentioned before. In summary, the formant frequencies, including the center frequency of the singers formant cluster, differ significantly between the main voice classifications. These differences probably reflect differences in vocal tract dimensions, particularly the pharynx-to-mouth length ratios.

V.

Intensity and Masking

Opera and concert singers performing in the classical style are sometimes accompanied by an orchestra that may be quite loud; the ambient sound level in a concert hall may reach 90 to 100 dB. The masking effect of a sound is strongly dependent upon how the sound energy is distributed along the frequency scale. This distribution can be visualized in terms of an LTAS. The spectrum shown in Figure 9 was obtained from a recording of the Vorspiel of the first act of Wagners opera Die Meistersinger, and most orchestral music in Western culture produces a similar LTAS. The strongest spectral components generally appear in the region of 200500 Hz, and above 500 Hz, the curve falls off by about 9 dB/octave, depending on how loudly the orchestra is playing (Sundberg, 1972). The masking effect of a sound with an LTAS like the one shown in Figure 9 is of course largest at those frequencies where the masking sound is loudest. It decreases more steeply below than above the maskers frequency. Thus, on average, the masking effect of the sound of the orchestra will be greatest at 200500 Hz and less for higher and particularly for lower frequencies. The other curve in Figure 9 shows an LTAS averaged across 15 untrained male voices reading a standard text with a loud voice. This LTAS is strikingly similar to that of an orchestra, thus suggesting that the combination of a loud orchestra with

3. Perception of Singing

85

a human speaking voice would be quite unfortunate; the orchestra would mask the voice. And, conversely, if the sound of the voice were much stronger (which is very unlikely), the orchestra would then be masked. Therefore, the acoustic characteristics of the human voice as observed in loud male speech are not useful for solo singers accompanied by a loud orchestra. Let us now return to the case of high-pitched singing. In this case, the spectrum will be dominated by the fundamental if F1 is tuned to a frequency near F0, as mentioned earlier. This can be expected to occur as soon as F0 is higher than the normal value of F1, which varies between about 300 and 800 Hz, depending on the vowel, as was illustrated in Figure 2. From what was described earlier about masking, we see that all vowels are likely to be masked by a loud orchestra provided that their F0 is below 500 Hz (below around B4). However, the vowels /, a, /, which have a first formant well above 500 Hz, will have their strongest partial above 500 Hz, so they should be less vulnerable to masking. In summary, a female singers voice can be expected to be masked by a strong orchestral accompaniment if the pitch is below B4 and the vowel is not /a, a, /. This seems to agree with the general experience of female voices in opera singing. They are generally not difficult to hear when they sing at high pitches, even when the orchestral accompaniment is loud. As discussed earlier, male classically trained singers produce a singers formant cluster, consisting of a high spectral peak somewhere between 2000 and 3000 Hz. In that frequency range, the sound of an orchestra tends to be about 20 dB weaker than the partials near 500 Hz, as can be seen in Figure 9. As a consequence, the singers formant cluster is very likely to cut through the sound of the orchestra. The effect should be particularly strong if the singer faces the audience; while lowfrequency components scatter spherically from the lip opening, high-frequency components are radiated more sagittally, along the continuation of the length axis of the mouth (Cabrera, Davis, & Connolly, 2011; Marshal & Meyer, 1985). Lowfrequency components are likely to be absorbed in the backstage area. Spectral partials in the singers formant cluster, by contrast, are lost to a lesser extent as their radiation is more limited to the sagittal direction. Hence, provided that the singer is facing the audience, the partials in the singers formant cluster will be stronger than the lower partials in the sound reaching the audience. Two exceptions to the principle that sounds masked by a competing sound are inaudible might be mentioned. One exception is when the softer sound begins some fraction of a second earlier than the masking sound (cf. Rasch, 1978; Palmer, 1989). The other exception applies to the situation when the masking sound is time varying. Plomp (1977) demonstrated that we can hear an otherwise intermittently masked sound as continuous if the masking signal is interrupted regularly (see also Chapter 6, this volume, on effects on onset asynchrony and auditory continuity). Both these cases might apply to the singer-orchestra combination. A singer may avoid masking by starting the tones earlier than the orchestra does. Further, an orchestral accompaniment, of course, varies in intensity, which may help the singers voice to be heard.

86

Johan Sundberg

VI.

Aspects of Voice Timbre

A. Placement
Many singers and singing teachers speak about placement and the need to project or focus the voice in order for the voice to reach the far end of a large audience. Projection was studied by Cabrera and associates (2011), who found reasons to conclude that the sound radiated from a singer can be made to change depending on the singers intention to project. Placement can be forward, which is generally considered desirable, and backward, which is considered undesirable. Vurma and Ross (2002) studied the acoustical correlates of forward and backward projection. They first ran a listening test in which expert subjects were asked to determine whether a triad sung by different singers on different vowels was placed forward or backward. They then measured spectral characteristics of triads classified as placed forward and placed backward and observed that F2 and F3 tended to be higher in the triads that were perceived as placed forward. They also noted that the singers formant cluster was more prominent in such triads. The term placement may be related to the fact that F3 tends to drop if the tongue tip is retracted. The increase in level of the singers formant cluster may be the result of the increase in F2 and F3; a halving of the frequency separation between two formants will automatically increase their levels by 6 dB (Fant, 1960). Gibian (1972) synthesized vowels in which he varied F4 while keeping the remaining formants constant. An expert on singing found that the placement in the head of the tone was most forward when F4 was 2.7 kHz, which was only 0.2 kHz above F3.

B. Text Intelligibility
We have seen that female singers gain considerably in sound level by abandoning the formant frequencies typical of normal speech when they sing at high pitches. At the same time, F1 and F2 are decisive to vowel quality. This leads to the question of how it is possible to understand the lyrics of a song when it is performed with the wrong F1 and F2 values. Both vowel intelligibility and syllable/text intelligibility can be expected to be disturbed. This aspect of singing has been studied in several investigations. As a thought-provoking reminder of the difficulties in arranging well-controlled experimental conditions in the past, an experiment carried out by the German phonetician Carl Stumpf (1926) may be mentioned. He used three singer subjects: a professional opera singer and two amateur singers. Each singer sang various vowels at different pitches, with their backs turned away from a group of listeners who tried to identify the vowels. The vowels that were sung by the professional singer were easier to identify. Also, overall, the percentages of correct identifications dropped as low as 50% for several vowels sung at the pitch of G5 (784 Hz).

3. Perception of Singing

87

Since then, many investigations have been devoted to intelligibility of sung vowels and syllables (see, e.g., Benolken & Swanson, 1990; Gregg & Scherer, 2006; Morozov, 1965). Figure 12 gives an overview of the results in terms of the highest percentage of correct identifications observed in various investigations for the indicated vowels at the indicated pitches. The graph shows that vowel intelligibility is reasonably accurate up to about C5 and then quickly drops with pitch to about 15% correct identification at the pitch of F5. The only vowel that has been observed to be correctly identified more frequently above this pitch is /a/. Apart from pitch and register, larynx position also seems to affect vowel intelligibility (Gottfried & Chew, 1986; Scotto di Carlo & Germain, 1985). Smith and Scott (1980) strikingly demonstrated the significance of consonants preceding and following a vowel. This is illustrated in the same graph. Above the pitch of F5, syllable intelligibility is clearly better than vowel intelligibility. Thus, vowels are easier to identify when the acoustic signal contains some transitions (Andreas, 2006). Incidentally, this seems to be a perceptual universal: changing stimuli are easier to process than are quasi-stationary stimuli. The difficulties in identifying vowels and syllables sung at high pitches would result both from singers deviations from the formant frequency patterns of normal speech and from the fact that high-pitched vowels contain few partials that are widely distributed over the frequency scale, producing a lack of spectral information.
100
Highest reported percentage of correct identification

o,i u o i

i u a

o,i o

i,e u,e a a

80

i a ua

60

i,e i,e

40 u 20 o,u,e i

0 0 6 12 18 24 Pitch (semitones above A3, 220 Hz) 30

Figure 12 Highest percentage of correct vowel identifications observed at different pitches by Scotto di Carlo and Germain (1985), Sundberg (1977b), and Sundberg and Gauffin (1982). The open circles show corresponding data for syllables reported by Smith and Scott (1980).

88

Johan Sundberg

In addition, a third effect may contribute. Depending on phonation type, the F0 varies in amplitude. At a high pitch, F1 may lie between the first and the second partial. Sundberg and Gauffin (1982) presented synthesized, sustained vowel sounds in the soprano range and asked subjects to identify the vowel. The results showed that an increased amplitude of the F0 was generally interpreted as a drop in F1. It seems likely that our experience of listening to speech biases our identification of vowels and syllables. Children have short vocal tracts and short vocal folds, so they combine high formant frequencies with high pitches. In any event, improved similarity in vowel quality under conditions of increasing F0 can be obtained if a high F0 is combined with an increased F1 (Fahey, Diehl, & ller, 1996; Slawson, 1968). Traunmu Unlike musical theatre singers, classically trained operatic voices possess a singers formant cluster, as described earlier. This cluster enhances the higher spectral partials, which are crucial to consonant identification and hence to syllable intelligibility. Sundberg and Romedahl (2009) tested the hypothesis that male operatic voices will produce better text intelligibility than musical theatre singers in the presence of a loud masking noise. They presented test syllables in a carrier phrase sung by two professional singers of both classifications in a background of party babble noise, which had an LTAS similar to that of an orchestra. Listeners were asked to identify a test syllable that appeared in the carrier phrase. It turned out that the two singer types were almost equally successful, even though the sound of the operatic voices was much easier to discern when the background noise was loud. Thus, one could easily discern the voice but not the text. A relevant factor may be that the opera singers produced much shorter consonants than the musical theatre singers. It is likely that short consonants are harder to identify in the presence of a loud noise. If so, by extending the durations of the consonants, the musical theatre singer may gain text intelligibility that they would otherwise probably lose because of the absence of a singers formant cluster.

C. Larynx Height
The perception of voice seems to be influenced by familiarity with ones own voice production. The finding mentioned earlier that perceived vocal loudness is more closely related to subglottal pressure than to SPL may be seen as a sign that we hear relative to what would be needed for producing the acoustic characteristics that we perceived. Similarly, other perceptual dimensions of voice quality seem physiological rather than acoustic under some conditions. Vertical larynx positioning seems to be an example of this. The acoustic correlates of perceived changes in larynx height were investigated in a synthesis experiment (Sundberg & Askenfelt, 1983). The stimuli consisted of a series of ascending scales. Toward the end of the scale, acoustic signs of a raised larynx were introduced in terms of a weakened voice-source fundamental, increased formant frequencies, and decreased vibrato extent. These stimulus characteristics were selected on the basis of measurements on vowels produced with deliberately altered larynx positions. The stimuli were presented to a group of singing teachers

3. Perception of Singing

89

who were asked to decide whether or not the imagined singer was raising his larynx while singing the top notes of the scale. The results showed that the perception of a raised larynx was elicited most efficiently by an increase in the formant frequencies. However, the reduced amplitude of the fundamental also promoted the impression of a raised larynx. In addition, a reduced extent of vibrato contributed, provided that the amplitudes of the formant frequencies and the F0 were already suggesting a raised larynx. These results are not surprising, and they illustrate certain perception strategies. The strong dependence on formant frequencies is logical, as a raised larynx will necessarily induce an increase in the formant frequencies, so it is a reliable sign of a raised larynx. The reduced amplitude of the fundamental, however, is a sign also of a change toward a more pressed phonation, and such a change does not necessarily accompany an elevation of the larynx. Therefore it is logical that this was not a sufficient condition for evoking the perception of a raised larynx, and neither was a reduced extent of vibrato.

D. Singers Identity
Voice timbre is determined by the spectral characteristics, which, in turn, are determined by the formant frequencies and the voice source, as was mentioned before. Because the partials of vocal sounds are harmonic, the partials are densely packed along the frequency scale, as long as F0 is low. Formant frequencies vary between individuals and thus characterize a persons voice. At low pitches, it should be easy for a listener to recognize a person from the formant peaks in his or her vocal spectrum. However, if F0 is high, the partials are widely separated along the frequency continuum, and formants will be hard to detect. This sets the background for a set of studies carried out by Molly Erickson and associates (Erickson, 2003, 2009; Erickson & Perry, 2003; Erickson, Perry, & Handel, 2001). She ran listening tests in which she presented recordings of threeor six-note patterns sung by various singers. The stimuli were arranged according to an oddball strategy, such that two of the patterns were sung by the same singer and the third by a different singer. The listeners were asked to tell which one was sung by the different singer. Listeners often failed to identify the oddball case correctly, particularly when the stimuli differed substantially in pitch, so a pitch difference was often interpreted as a sign of a different singer. The results were better for male than for female voices. Thus, it is difficult to determine who is singing by listening to just a few notes, particularly at high pitches.

E. Naturalness
Synthesis is a valuable tool in the identification of acoustical and physiological correlates of perceptual qualities of the singing voice. For example, let us assume that we have found a number of acoustic characteristics of a particular voice on the basis of a number of measurements. Then, all these characteristics can be included in a synthesis, varied systematically, and assessed in a listening test. The synthesis will sound

90

Johan Sundberg

0 10 20 30 40 50 60

0 10 20 30 40 50 60

1 2 3 Frequency (kHz)

2 3 Frequency (kHz)

Figure 13 Spectra of the same vowel sounding clearly different with regard to naturalness. The left spectrum sounded unnatural mainly because the formant peaks have an unrealistic shape in that the flanks slopes are not concave enough. To facilitate comparison, the spectral envelope of the left spectrum has been superimposed on the right spectrum. After Sundberg (1989).

exactly as the original sounds only if all perceptually relevant acoustic properties are correctly represented. In other words, synthesis provides a powerful tool in determining to what extent an acoustic description of a voice is perceptually exhaustive. In listening tests with synthesized stimuli, naturalness is essential. If the stimuli do not sound natural, the relevance of the results of a listening test is likely to be compromised. Perceived naturalness may depend on quite unexpected spectral characteristics. Figure 13 offers an example. It shows two spectra of the same vowel, one sounding natural and the other sounding unnatural. The spectra are almost identical. The difference, which is acoustically inconspicuous but perceptually important, consists of a minor detail in the shapes of the formant peaks in the spectrum. The version that sounds unnatural had formant peaks that were too blunt. It is interesting that this minute spectral property is perceptually important. Again, however, the perceptual strategy is quite logical. Such blunt spectral peaks can never be generated by a human vocal tract and can thus be regarded as a reliable criterion of unnaturalness.

VII.

Vibrato

A. Physical Characteristics
Vibrato occurs in most Western opera and concert singing and often also in popular music. Generally, it develops more or less automatically during voice training

3. Perception of Singing

91

rklund, 1961). Acoustically, it corresponds to an almost sinusoidal undulation (Bjo of F0 and thus can be called frequency vibrato. It can be described in terms of two parameters: (1) the rate, that is, the number of undulations occurring per second, and (2) the extent, that is, the depth of the modulation expressed in cents (1 cent is a hundredth of a semitone). Several aspects of frequency vibrato have been studied (for an overview see Dejonkere, Hirano, & Sundberg, 1995). According to Prame (1994, 1997), the rate typically lies between 5.5 and 6.5 Hz, but tends to speed up somewhat toward the end of a long sustained tone. The extent of vibrato depends strongly on the singer and on the repertoire, but typically lies in the range of 630 cents and 6120 cents, the mean across tones and singers being about 670 cents. As the spectra of voiced sounds are harmonic, the frequencies of all partials vary in synchrony with the fundamental. The modulation amplitude of a partial depends on how far it is from a formant, while the formant frequencies do not seem to vary appreciably with the vibrato (Horii, 1989). Therefore, each partial varies in amplitude synchronously with the vibrato. In pop music, another type of vibrato is sometimes used. It corresponds to an undulation of loudness, rather than of F0 and can thus be referred to as amplitude vibrato. There are reasons to assume that it is generated by undulations of subglottal pressure. It sounds different from the opera singers frequency vibrato. The physiological background of the frequency vibrato has been described by Hirano and coworkers (Hirano, Hibi, & Hagino, 1995). Electromyographic measurements on laryngeal muscles have revealed pulsations in synchrony with vibrato (Vennard, Hirano, Ohala, & Fritzell, 19701971). The variations in innervation that cause the pitch to undulate are most likely those occurring in the pitchraising cricothyroid muscles (Shipp, Doherty, & Haglund, 1990). As secondary induced effects, subglottal pressure and transglottal airflow sometimes undulate in synchrony with vibrato. Such pulsations can be observed in some recordings published by Rubin, LeCover, and Vennard (1967).

B. Perceptual Aspects 1. Vowel Intelligibility


At high F0s, the spectral partials are widely spaced along the frequency continuum, and therefore it is difficult to detect where the formants are located; there may be no partial near the formants. It is not unreasonable to suspect that vibrato would facilitate vowel identification at high F0s, since the vibrato causes the partials to move in frequency and the amplitude variations that accompany the frequency variations then give some hints regarding the positions of the formants. The simple principle is that a partial grows in amplitude as it approaches a formant frequency and decreases in amplitude as it moves away from a formant frequency, as illustrated in Figure 14. Frequency vibrato is therefore accompanied by oscillations in intensity that are either in phase or in counterphase with the F0, depending on whether the strongest partial is just below or just above F1. A double intensity phasing occurs when a harmonic is close to the formant frequency and moves both

92

Johan Sundberg

Figure 14 Illustration of the fact that in a tone sung with a frequency vibrato, the amplitude and frequency of a spectral partial vary in phase or in counterphase, depending on whether the partial is slightly lower or higher than the closest formant frequency. The hatched area represents the width of the frequency modulation, and the frequency scale is linear. From Sundberg (1995).

Amplitude

Frequency

above and then below the formant peak during the vibrato cycle. Thus, phase relationships between the undulations in frequency and amplitude of a tone with vibrato actually inform us about the frequency locations of the formants. The question, then, is whether the ear can detect and use this information. If so, vibrato would facilitate vowel identification at high pitches. The influence of vibrato on the identification of synthesized vowels with an F0 between 300 and 1000 Hz was investigated by Sundberg (1977b). Phonetically trained subjects were asked to identify these stimuli as any of 12 Swedish long vowels. The effects that were observed were mostly small. As this result seems counterintuitive, McAdams and Rodet (1988) carried out an experiment in which tones with and without a vibrato were presented to four subjects. The tones had identical spectra when presented without vibrato but differed when presented with vibrato. Figure 15 shows the spectra and formant patterns they used to obtain this effect. The task of the subjects was to decide whether two stimuli that were presented in succession were identical or not. The subjects were able to hear the difference in the tones with vibrato but needed extensive training to hear the effect. These results suggest that vibrato normally does not facilitate vowel identification to any great extent.

2. Singleness in Pitch
In general, it is well-established that F0 determines pitch. In the case of tones with vibrato, however, this is not quite true. Although F0 varies regularly in such tones, the pitch we perceive is perfectly constant as long as the rate and extent of vibrato are kept within certain limits. What are these limits? Ramsdell studied this question at Harvard University in a thesis that unfortunately was never published. Ramsdell varied the rate and extent of vibrato systematically and had listeners decide when the resulting tone possessed an optimal singleness in pitch. His results for a 500-Hz tone are shown in Figure 16.

3. Perception of Singing

93

0 10 20 30

F1

F2

F3 F4

F5 40 50

Figure 15 Stimulus spectra and implicated formant patterns used by McAdams and Rodet (1988) in an experiment testing the relevance of the vibrato-tovowel identification; the same spectrum could be obtained by the two different formant frequency patterns shown by the dashed curves.

Level (dB)

Frequency (kHz)

140 120 Vibrato extent ( cent) 100 80 60 F0 = 200 Hz 40 20 0 4.5 5 5.5 6 6.5 7 Vibrato rate (undulations/sec) 7.5 8 F0 = 150 & 200 Hz F0 = 100 Hz

Figure 16 Vibrato extent values producing optimal singleness in pitch at different vibrato rates (according to Ramsdell, see text). The circled symbols show maximum perceived similarity to human singing voice obtained by Gibian (1972). Ramsdells data were obtained with a F0 of 500 Hz, whereas Gibians data pertain to the F0 values indicated in the graph.

Later Gibian (1972) studied vibrato in synthetic vowels. He varied the rate and extent of vibrato and had subjects assess the similarity of this vibrato with vibrato produced by human voice. His results agree closely with Ramsdells data, as can be seen in the figure. In addition to asking the listeners for the optimal singleness in pitch, Ramsdell also asked them to evaluate the richness in the timbre. His data showed that the optimum in regard to singleness in pitch as well as timbral

94

Johan Sundberg

richness corresponded to the values of rate and extent of vibrato typically observed in singers. It is interesting that Ramsdells curve approaches a vertical straight line in the neighborhood of seven undulations per second. This implies that the extent of vibrato is not very critical for singleness in pitch at this rate.

3. Pitch and Mean F0


Another perceptual aspect of vibrato is perceived pitch. Provided that the rate and extent of vibrato are kept within acceptable limits, what is the pitch we perceive? This question was studied independently by Shonle and Horan (1980) and Sundberg (1972, 1978b). Sundberg had musically trained subjects match the pitch of a tone with vibrato by adjusting the F0 of a subsequent vibrato-free tone. The two tones, which were synthesized sung vowels, were identical except for the vibrato. They were presented repeatedly until the adjustment was completed. The rate of the vibrato was 6.5 undulations per second, and the extent was 630 cents. Figure 17 shows the results. The ear appears to compute the average of the undulating frequency, and perceived pitch corresponds closely to this average. Shonle and Horan used sine-wave stimuli and arrived at practically the same conclusion. However, they also showed that it is the geometric rather than the arithmetic mean that determines pitch. The difference between these two means is very small for musically acceptable vibratos. It is frequently assumed that the vibrato is useful in musical practice because it reduces the demands on accuracy of F0 (see, e.g., Stevens & Davis, 1938; Winckel, 1967). One possible interpretation of this assumption is that the pitch of a tone with vibrato is less accurately perceived than the pitch of a vibrato-free tone. Another interpretation is that the pitch interval between two successive tones is perceived less accurately when the tones have vibrato than when they are vibrato-free.
Deviation from linear mean (cent) 154 ms 30 20 10 0 10 20 30 70 100 150 200 300 Fundamental frequency Time 2.0% 1.0 0 1.0 2.0 Fundamental frequency (Hz)

Figure 17 Left panel: mean F0 of a synthesized nonvibrato vowel that musically trained subjects perceived as having the same pitch as the same vowel presented with vibrato (After Sundberg, 1978b). The right panel shows the waveform, rate, and extent used in the experiment.

3. Perception of Singing

95

The first interpretation was tested by Sundberg (1972, 1978a). The standard deviations obtained when subjects matched the pitch of a tone with vibrato with the pitch of a vibrato-free tone were compared with those obtained from similar matchings in which both tones lacked vibrato. As can be seen in Figure 18, the differences between the standard deviations were extremely small and dropped slightly with rising F0. This implies that vibrato reduces the accuracy of pitch perception slightly for low frequencies. On the other hand, the effects are too small to explain any measurable effects in musical practice. The second interpretation was tested by van Besouw, Brereton, and Howard (2008). They presented three-tone ascending and descending arpeggios to musicians. The tuning of the middle tone, which either had or lacked vibrato, was varied and the listeners task was to decide when it was in tune and when it was out of tune. The results showed that the range of acceptable intonation of the middle tone was on average about 10 cents wider when it had vibrato than when it lacked vibrato. There is also a third possible benefit of vibrato, namely in the intonation of simultaneously sounding tones forming a consonant interval. If two complex tones with harmonic spectra sound simultaneously and constitute a perfectly tuned consonant interval, some partials of one tone will coincide with some partials of the other tone. For instance, if two tones with F0 of 200 and 300 Hz (i.e., producing a perfect fifth) sound simultaneously, every third partial of the lower tone will coincide with every second partial of the upper tone. A mistuning of the interval will cause beats.

10 6 4 (cent) 5 2 0 0 2 5 50 100 F0 (Hz) 200 300 400

Mean

Figure 18 Effect of vibrato on accuracy of pitch perception as a function of F0 observed when musically trained subjects first matched the pitch of a stimulus vowel lacking vibrato with a subsequent response vowel that also lacked vibrato, and then repeating the test with stimuli vowels that had vibrato. The ordinate shows the differences in standard deviation obtained between these two conditions. Symbols refer to subjects and the heavy curve represents the group mean. From Sundberg (1978b).

96

Johan Sundberg

These beats would disappear if one of the tones had vibrato. Thus, if two voices sing perfectly straight (i.e., without vibrato), the demands on accuracy with respect to the F0 are higher than if they sing with vibrato. In staccato coloratura singing, tones shorter than the duration of a vibrato cycle sometimes appear. dAlessandro and Castellengo (1991) measured the perceived pitch of such short tones. Interestingly, they found that the rising half of a vibrato cycle, when presented alone, was perceived as 15 cents higher than the mean F0 while the falling half was perceived as 11 cents below the mean. The authors concluded that the ending of such short pitch glides is more significant to pitch perception than the beginning. Our conclusions are that the pitch of a vibrato tone is practically identical to the pitch of a vibrato-free tone with an F0 equal to the geometric mean of the F0 of the tone with vibrato. Moreover, the accuracy with which the pitch of a tone with vibrato is perceived is not affected to any appreciable extent by the vibrato.

VIII.

Intonation in Practice

A couple of investigations on the perceived pitch of vibrato tones were mentioned earlier. These investigations were carried out under well-controlled experimental conditions. Do the results thus obtained apply also to musical practice? A study of the accuracy of F0 in musical practice is likely to answer that question. In a review of a number of investigations, Seashore (1938/1967) included a wealth of documentation of F0 recordings of professional performances of various songs. The trend was that long notes were sung with an average F0 that coincides with the theoretically correct value. This is in agreement with the experimental findings reported previously. On the other hand, they often begin slightly flat (about 90 cents on the average) and are gradually corrected during the initial 200 msec of the tone. Moreover, a great many of the long tones were observed to change their average frequency in various ways during the course of the tone. Bjrklund (1961) found that such deviations were typical for professional singers as opposed to nonprofessional singers. One possible interpretation of this is that pitch is used as a means of musical expression. With regard to short tones, the relationship between F0 and pitch seems to be considerably more complicated. The case is illustrated in Figure 19, which displays the pattern of F0s during a coloratura passage as sung by a male singer. The singer judged this performance to be acceptable. The registration reveals a careful coordination of amplitude, vibrato, and F0. Each note takes one vibrato period, and most of the vibrato periods seem to approximately encircle the target frequency. According to Seashore (1938/1967), the musical ear is generous and operates in the interpretive mode when it listens to singing. On the other hand, there are certainly limits to this generosity. Also, what appears to be generosity may be sensitivity to small, deliberate, and meaningful deviations from what theoretically is correct.

3. Perception of Singing

97

Pitch
A 200 200 cent G E D C Time (arbitrary scale) F0 F

F0 (Hz)

Time 1s

Figure 19 Left, F0 of a professional singers performance of the coloratura passage shown at the top. The horizontal dotted lines represent the frequencies midway between the center frequencies of the scale tones as calculated according to equal tempered tuning, using the mean F0 of the final C3 as the reference. Right, thin curve shows the F0 curve resulting from superimposing a sinusoid on a ramp. The heavy curve shows the running average obtained with a window length of the sine-wave cycle.

Sundberg, Prame, and Iwarsson (1996) studied what mean F0s were accepted as being in tune and out of tune in 10 commercial recordings of a song that were presented to expert listeners on a listening tape. A chart with the score of the excerpts was given to the listeners, and they were asked to circle each note they perceived to be out of tune. F0 was averaged for each tone. These mean frequencies were then related to equal tempered tuning, using the tuning of the accompaniment as the reference. The results showed a rather large variability in the judgments. Analysis of the clear cases, that is, tones that were accepted as in tune by all experts or deemed as out of tune by most listeners, revealed that for most tones accepted as in tune, the mean F0 varied within a band of about 67 cents, whereas most tones judged as out of tune were outside this rather narrow frequency band. Furthermore, the bands corresponding to tones that were perceived as in tune did not always agree with the F0s of equal tempered tuning. For some tones, moreover, the mean F0 that was accepted as in tune varied widely. These tones seemed to be harmonically or melodically marked. Most of the singers seemed to adhere to certain principles in their deviations from the equal tempered tuning. One was to sing high tones sharp, that is, to add an F0 correction that increased with pitch. The other was to sharpen and flatten the tones that were situated on the dominant (right) and subdominant (left) side of the circle of fifths, where the root of the prevailing chord was the 12 oclock reference. Thus, the deviations from scale-tone frequencies according to equal tempered tuning appeared systematic. , and Himonides (2011) analyzed the tuning of premier baritone Sundberg, La singers and found examples of quite large deviations from equal tempered tuning, sometimes exceeding 50 cents. In particular, the highest note in phrases with an agitated emotional character was often sharpened. The intonation of such tones was flattened to equal tempered tuning, and a listening test was run in which musician listeners were asked to rate the expressiveness in pair-wise comparisons

98

Johan Sundberg

of the original version and the version with manipulated tuning. There was a significant preference for the original versions. This result indicates that intonation can be used as an expressive device in singing. Such meaningful deviations from equal tempered tuning are used as expressive means also in instrumental music (Fyk, n, 1991). 1995; Sirker, 1973; Sundberg, Friberg & Fryde As mentioned earlier, vibrato-free performance of mistuned consonant intervals with simultaneously sounding tones gives rise to beats, and beats are generally avoided in most types of music. By adding vibrato, the singer escapes the beats. Consequently, the vibrato seems to offer the singer access to intonation as an expressive means.

IX.

Expression

Expressivity is often regarded as one of the most essential aspects of singing, and it has been analyzed in a large number of investigations (for a review, see Juslin & Laukka, 2003). The focus has mostly been on basic emotions, such as anger, fear, joy, sadness, and tenderness. Only some examples of the findings reported in this research are reviewed here. The communication of basic emotions works pretty well in singing. About 60%80% correct identifications have been observed in forced-choice listening tests concerning moods like anger, fear, and joy (Kotlyar & Morozov, 1976; Siegwarth & Scherer, 1995). Details in the performance that contain the singers messages regarding emotions were studied by Kotlyar and Morozov (1976). They had singers perform a set of examples so as to represent different moods. They noted important effects on tempo and overall loudness and also observed characteristic time patterns in pitch and amplitude, as well as micropauses between syllables. Siegwarth and Scherer (1995) observed that the singers tone production is also relevant, in particular, the dominance of the fundamental and the amplitudes of the high partials. Rapoport (1996) found that singers used an entire alphabet of different F0 patterns for expressive purposes. For example, some tones approach their target value with a rapid or slow ascending glide, whereas others hit their target F0 at the tone onset. In most studies of emotional coloring of singing, an agitated versus peaceful rd (1995) character is a dominant dimension. Sundberg, Iwarsson, and Hagega compared performances of a set of music excerpts that were sung without accompaniment by a professional opera singer. The singer sang the excerpts in two ways, either as in a concert or as void of musical expression as he could. A number of characteristics were observed that appeared to differentiate agitated from peaceful excerpts. Thus, in agitated examples, sound level changes were more rapid, vocal loudness was higher, tempo was faster, and vibrato amplitude was generally greater than in the peaceful examples, particularly in the expressive versions. In excerpts with a calm ambiance, the opposite differences were observed between the expressive and neutral versions. Thus, the singer enhanced the difference between agitated and peaceful in the concert versions.

3. Perception of Singing

99

What information is conveyed by expressivity? Phrase marking appears to be an important principle, which, however, does not seem to differentiate between expressive and neutral. Another principle seems to be to enhance differences between different tone categories such as scale tones, musical intervals, and note values; the sharpening of the peak tone in a phrase described earlier can be seen as an example of this principle. A third principle is to emphasize important tones. By singing with expression, singers thus may help the listener with three cognitive tasks: (1) to realize which tones belong together and where the structural boundaries are, (2) to enhance the differences between tone and interval categories, and (3) to emphasize the important tones. Obviously, singers use an acoustic code for adding expressivity to a performance. As pointed out by Juslin and Laukka (2003), the code is similar to that used in speech; in fact, it would be most surprising if different codes were applied in speech and singing in order to convey the same information. For example, the slowing of the tempo toward the end of musical phrases is similar to the final lengthening principle used in speech for marking the end of structural units such as sentences. Likewise, in both singing and speech, an important syllable or tone can be emphasized by lengthening its upbeat (Sundberg et al., 1995). However, the expressive code used in singing may not necessarily be simply imported from that used in speech. As charmingly pointed out by Fonagy (1967, 1976, 1983), the actual origin of all changes in vocal sounds is the shape of the vocal tract and the adjustment of the vocal fold apparatus; the voice organ simply translates movement into sound changes. Fonagy argues that the expressiveness of speech derives from a pantomimic behavior of these organs. For example, in sadness, the tongue assumes a slow, depressed type of motion that stamps its own characteristics upon the resulting sound sequences.

X.

Concluding Remarks

In the present chapter, two types of facts about singing have been considered. One is the choice of acoustic characteristics of vowel sounds that singers learn to adopt and that represent typical deviations from normal speech. Three examples of such characteristics have been discussed: (1) pitch-dependent choices of formant frequencies in high-pitched singing, (2) the singers formant cluster that typically occurs in all voiced sounds in the classically trained male singing voice, and (3) the vibrato that occurs in both male and female singing. There are good reasons to assume that these characteristics serve a specific purpose. The pitch-dependent formant frequencies as well as the singers formant cluster are both resonatory phenomena that increase the audibility of the singers voice in the presence of a loud orchestral accompaniment. As resonatory phenomena occur independently of vocal effort, the increase in audibility is gained without expense in terms of vocal effort; hence, a likely purpose in both these cases is vocal economy. The vibrato seems to serve the purpose of allowing the singer

100

Johan Sundberg

a greater freedom with regard to intonation, as it eliminates beats with the sound of a vibrato-free accompaniment. Thus, in these three cases, singing differs from speech in a highly adequate manner. It is tempting to speculate that these characteristics have developed as a result of evolution; the singers who developed them became successful, and hence their techniques were copied by other singers. A second kind of fact about singing discussed in this chapter is the acoustic correlates of various voice classifications that can be assumed to be based on perception. Such classifications are not only tenor, baritone, bass, and so on, but also vocal effort (e.g., piano, mezzo piano), and register. We have seen that in most of these cases it was hard to find a common acoustic denominator, because the acoustic characteristics of the categories vary with vowel and F0. Rather, the common denominator seems to exist within the body. In the case of the male voice classificationtenor, baritone, and bassthe characteristic differences in formant frequency would be the result of morphological differences in the length of the vocal tract and the vocal folds. The same is true for vocal effort and register, because they reflect differences in the control and operation of the vocal folds. Therefore, these examples of voice classification seem to rely on the properties of the airway structures rather than on specific acoustic properties of voice sounds. This is probably revealing relative to the way we perceive singing voices. We appear to interpret these sounds in terms of how the voice-producing system was used to create them. With regard to artistic interpretation, it seems that this contains at least three different components. One is the differentiation of different note types, such as scale tones and note values. Another component is the marking of boundaries between structural constituents such as motifs, subphrases, and phrases. These requirements of sung performance seem to apply to both speech and music and are likely to have been developed in response to the properties of the human perceptual system. The third component is the signaling of the emotional ambience of the text and the music. Also in this respect, perception of singing appears to be closely related to perception of speech. The coding of emotions in speech and singing would be similar and probably founded on a body language for communication of emotions. If this is true, our acquaintance with human emotional behavior and particularly speech serves as a reference in our decoding of the emotional information in singing.

References
Andreas, T. (2006). The influence of tonal movement and vowel quality on intelligibility in singing. Logopedics Phoniatrics Vocology, 31, 1722. Appelman, D. R. (1967). The science of vocal pedagogy. Bloomington, IN: Indiana University Press. Bartholomew, W. T. (1934). A physical definition of good voice quality in the male voice. Journal of the Acoustical Society of America, 6, 2533.

3. Perception of Singing

101

Benolken, M. S., & Swanson, C. E. (1990). The effect of pitch-related changes on the perception of sung vowels. Journal of the Acoustical Society of America, 87, 17811785. Berndtsson, G., & Sundberg, J. (1995). Perceptual significance of the center frequency of the singers formant. Scandinavian Journal of Logopedics and Phoniatrics, 20, 3541. Bjrklund, A. (1961). Analysis of soprano voices. Journal of the Acoustical Society of America, 33, 575582. rkner, E. (2008). Musical theater and opera singingwhy so different? A study of subBjo glottal pressure, voice source, and formant frequency characteristics. Journal of Voice, 22, 533540. Bloothooft, G., & Plomp, R. (1988). The timbre of sung vowels. Journal of the Acoustical Society of America, 84, 847860. Cabrera, D., Davis, D. J., & Connolly, A. (2011). Long-term horizontal vocal directivity of opera singers: effects of singing projection and acoustic environment. Journal of Voice, 25(6), e291e303. Cleveland, T. (1977). Acoustic properties of voice timbre types and their influence on voice classification. Journal of the Acoustical Society of America, 61, 16221629. Cleveland, T., Sundberg, J., & Stone, R. E. (2001). Long-term-average spectrum characteristics of country singers during speaking and singing. Journal of Voice, 15, 5460. Collyer, S., Davis, P. J., Thorpe, C. W., & Callaghan, J. (2009). F0 influences the relationship between sound pressure level and spectral balance in female classically trained singers. Journal of the Acoustical Society of America, 126, 396406. dAlessandro, C., & Castellengo, M. (1991). Etude, par la synthese, de la perception du vibrato vocal dans la transition de notes. Paper presented at the International Voice Conference in Besancon, France. Dejonkere, P. H., Hirano, M., & Sundberg, J. (Eds.) (1995). Vibrato. San Diego, CA: Singular Publishing Group. Dmitriev, L., & Kiselev, A. (1979). Relationship between the formant structure of different types of singing voices and the dimension of supraglottal cavities. Folia Phoniatrica, 31, 238241. Doscher, B. M. (1994). The functional unity of the singing voice (2nd ed.). London, England: Scarecrow Press. Echternach, M., Sundberg, J., Arndt, S., Markl, M., Schumacher, M., & Richter, B. (2010). Vocal tract in female registers: a dynamic real-time MRI study. Journal of Voice, 24, 133139. Ekholm, E., Papagiannis, G. C., & Chagnon, F. P. (1998). Relating objective measurements to expert evaluation of voice quality in western classical singing: critical perceptual parameters. Journal of Voice, 12, 182196. Erickson, M. L. (2003). Dissimilarity and the classification of female singing voices: a preliminary study. Journal of Voice, 17(2), 195206. Erickson, M. L. (2009). Can listeners hear who is singing? Part BExperienced listeners. Journal of Voice, 23, 577586. Erickson, M. L., Perry, S., & Handel, S. (2001). Discrimination functions: can they be used to classify singing voices? Journal of Voice, 15(4), 492502. Erickson, M. L., & Perry, S. R. (2003). Can listeners hear who is singing? A comparison of three-note and six-note discrimination tasks. Journal of Voice, 17(3), 353369. ller, H. (1996). Perception of back vowels: effects of Fahey, R. P., Diehl, R. L., & Traunmu varying F1-F0 bark distance. Journal of the Acoustical Society of America, 99, 23502357. Fant, G. (1960). Acoustic theory of speech production. The Hague, The Netherlands: Mouton.

102

Johan Sundberg

Fant, G. (1973). Speech sounds and features. Cambridge, MA: MIT Press. rbare Mimik. Phonetica, 16, 2535. Fonagy, I. (1967). Ho Fonagy, I. (1976). Mimik auf glottaler Ebene. Phonetica, 8, 209219. Fonagy, I. (1983). La vive voix. Paris, France: Payot. Fyk, J. (1995). Melodic intonation, psychoacoustics and the violin. Gora, Poland: Organon. Garnier, M., Henrich, N., Smith, J., & Wolfe, J. (2010). Vocal tract adjustments in the high soprano range. Journal of the Acoustical Society of America, 127, 37713780. Gibian, G. L. (1972). Synthesis of sung vowels. Quarterly Progress Report, Massachusetts Institute of Technology, 104, 243247. Gottfried, T., & Chew, S. (1986). Intelligibility of vowels sung by a countertenor. Journal of the Acoustical Society of America, 79, 124130. Gregg, J. W., & Scherer, R. C. (2006). Vowel intelligibility in classical singing. Journal of Voice, 20, 198210. Henrich, N. (2006). Mirroring the voice from Garcia to the present day: some insights into singing voice registers. Logopedics Phoniatrics Vocology, 31, 314. Henrich, N., Smith, J., & Wolfe, J. (2011). Vocal tract resonances in singing: strategies used by sopranos, altos, tenors, and baritones. Journal of the Acoustical Society of America, 129, 10241035. Hirano, M., Hibi, S., & Hagino, S. (1995). Physiological aspects of vibrato. In P. H. Dejonkere, M. Hirano, & J. Sundberg (Eds.), Vibrato (pp. 934). San Diego, CA: Singular Publishing Group. Hollien, H. (1983). The puzzle of the singers formant. In D. M. Bless, & J. H. Abbs (Eds.), Vocal fold physiology: Contemporary research and clinical issues (pp. 368378). San Diego, CA: College-Hill. Horii, Y. (1989). Acoustic analysis of vocal vibrato: theoretical interpretation of data. Journal of Voice, 3, 3643. Johansson, C., Sundberg, J., & Wilbrand, H. (1985). X-ray study of articulation and formant frequencies in two female singers. In A. Askenfelt, S. Felicetti, E. Jansson, & J. Sundberg (Eds.), SMAC 83: Proceedings of the Stockholm International Music Acoustics Conference (Vol. 1, pp. 203218). Stockholm, Sweden: The Royal Swedish Academy of Music (Publication No. 46). Joliveau, E., Smith, J., & Wolfe, J. (2004). Vocal tract resonances in singing: the soprano voice. Journal of the Acoustical Society of America, 116, 24342439. Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal expression and music performance: different channels, same code? Psychology Bulletin, 129, 770814. Kotlyar, G. M., & Morozov, V. P. (1976). Acoustical correlates of the emotional content of vocalized speech. Soviet Physics Acoustics, 22, 208211. Ladefoged, P., & McKinney, N. P. (1963). Loudness, sound pressure, and subglottal pressure in speech. Journal of the Acoustical Society of America, 35, 454460. Leino, T., Laukkanen, A. -M., & Leino, V. R. (2011). Formation of the actors/speakers formant: a study applying spectrum analysis and computer modeling. Journal of Voice, 25, 150158. Lindblom, B., & Sundberg, J. (1971). Acoustical consequences of lip, tongue, jaw, and larynx movements. Journal of the Acoustical Society of America, 50, 11661179. ., & So dersten, M. (1998). Laryngeal and pharyngeal behavior in counterLindestad, P. A tenor and baritone singing: a videofiberscopic study. Journal of Voice, 2, 132139. Marshal, A. H., & Meyer, J. (1985). The directivity and auditory impressions of singers. Acustica, 58, 130140.

3. Perception of Singing

103

McAdams, S., & Rodet, X. (1988). The role of FM-induced AM in dynamic spectral profile analysis. In H. Duifhuis, J. Horst, & H. Wit (Eds.), Basic issues in hearing (pp. 359369). London, England: Academic Press. Miller, D. G. (2008). Resonance in singing: Voice building through acoustic feedback. Princeton, NJ: Inside View Press. Morozov, V. P. (1965). Intelligibility in singing as a function of fundamental voice pitch. Soviet Physics Acoustics, 10, 279283. Nordenberg, M., & Sundberg, J. (2004). Effect on LTAS on vocal loudness variation. Logopedics Phoniatrics Vocology, 29, 183191. Palmer, C. (1989). Mapping musical thought to musical performance. Journal of Experimental Psychology, 15, 331346. Plomp, R. (1977, July). Continuity effects in the perception of sounds with interfering noise bursts. Paper presented at the Symposium sur la Psychoacoustique Musicale, IRCAM, Paris. Prame, E. (1994). Measurements of the vibrato rate of ten singers. Journal of the Acoustical Society of America, 94, 19791984. Prame, E. (1997). Vibrato extent and intonation in professional Western lyric singers. Journal of the Acoustical Society of America, 102, 616621. Rapoport, E. (1996). Expression code in opera and lied singing. Journal of New Music Research, 25, 109149. Rasch, R. A. (1978). The perception of simultaneous notes such as in polyphonic music. Acustica, 40, 2133. rbe, D., & Sundberg, J. (2009). Voice classification and vocal tract of singers: Roers, F., Mu a study of x-ray images and morphology. Journal of the Acoustical Society of America, 125, 503512. Rubin, H. J., Le Cover, M., & Vennard, W. (1967). Vocal intensity, subglottic pressure and airflow relationship in singers. Folia Phoniatrica, 19, 393413. Rzhevkin, S. N. (1956). Certain results of the analysis of a singers voice. Soviet Physics Acoustics, 2, 215220. Scotto di Carlo, N., & Germain, A. (1985). A perceptual study of the influence of pitch on the intelligibility of sung vowels. Phonetica, 42, 188197. Seashore, C. E. (1967). Psychology of music. New York, NY: Dover. (Original work published 1938). Seidner, W., Schutte, H., Wendler, J., & Rauhut, A. (1985). Dependence of the high singing formant on pitch and vowel in different voice types. In A. Askenfelt, S. Felicetti, E. Jansson, & J. Sundberg (Eds.), SMAC 83: Proceedings of the Stockholm International Music Acoustics Conference (Vol. 1, pp. 261268). Stockholm, Sweden: The Royal Swedish Academy of Music (Publication No. 46). Shipp, T., Doherty, T., & Haglund, S. (1990). Physiologic factors in vocal vibrato production. Journal of Voice, 4, 300304. Shipp, T., & Izdebski, C. (1975). Vocal frequency and vertical larynx positioning by singers and nonsingers. Journal of the Acoustical Society of America, 58, 11041106. Shonle, J. I., & Horan, K. E. (1980). The pitch of vibrato tones. Journal of the Acoustical Society of America, 67, 246252. Siegwarth, H., & Scherer, K. (1995). Acoustic concomitants of emotional expression in operatic singing: the case of Lucia in Ardi gli incensi. Journal of Voice, 9, 249260. henempfindung bei Sirker, U. (1973). Objektive Frequenzmessung und subjektive Tonho ngen. Swedish Journal of Musicology, 55, 4758. Musikinstrumentkla

104

Johan Sundberg

lander, P., & Sundberg, J. (2004). Spectrum effects of subglottal pressure variation in proSjo fessional baritone singers. Journal of the Acoustical Society of America, 115, 12701273. Slawson, A. W. (1968). Vowel quality and musical timbre as functions of spectrum envelope and F0. Journal of the Acoustical Society of America, 43, 87101. Smith, L. A., & Scott, B. L. (1980). Increasing the intelligibility of sung vowels. Journal of the Acoustical Society of America, 67, 17951797. Stevens, S. S., & Davis, H. (1938). Hearing, its psychology and physiology. New York, NY: Wiley. Stumpf, C. (1926). Die Sprachlaute. Berlin, Germany: Springer-Verlag. Sundberg, J. (1970). Formant structure and articulation of spoken and sung vowels. Folia Phoniatrica, 22, 2848. Sundberg, J. (1972). Production and function of the singing formant. In H. Glahn, S. Sorenson, & P. Ryom (Eds.), Report of the 11th Congress of the International Musicological Society, II (pp. 679688). Copenhagen, Denmark: Editor Wilhelm Hansen. Sundberg, J. (1974). Articulatory interpretation of the singing formant. Journal of the Acoustical Society of America, 55, 838844. Sundberg, J. (1975). Formant technique in a professional female singer. Acustica, 32, 8996. Sundberg, J. (1977a). Singing and timbre. In Music, room, acoustics (pp. 5781). Stockholm, Sweden: Royal Swedish Academy of Music (Publication No. 17). Sundberg, J. (1977b). Vibrato and vowel identification. Archives of Acoustics, 2, 257266. Sundberg, J. (1978a). Effects of the vibrato and the singing formant on pitch. Musicologica Slovaca, 6, 5169. Sundberg, J. (1978b). Synthesis of singing. Swedish Journal of Musicology, 60(1), 107112. n, & O. Olsson (Eds.), Structure and Sundberg, J. (1989). Aspects of structure. In S. Nielse perception of electroacoustic sound and music: Proceedings of the Marcus Wallenberg Symposium in Lund, Sweden, August 1988 (pp. 3342). Amsterdam, The Netherlands: Excerpta Medica. Sundberg, J. (1990). Whats so special about singers? Journal of Voice, 4, 107119. Sundberg, J. (1995). Acoustic and physioacoustics aspects of vocal vibrato. In P. H. Dejonkere, M. Hirano, & J. Sundberg (Eds.), Vibrato (pp. 3562). San Diego, CA: Singular Publishing Group. Sundberg, J. (2001). Level and center frequency of the singers formant. Journal of Voice, 15(2), 176186. Sundberg, J. (2009). Articulatory configuration and pitch in a classically trained soprano singer. Journal of Voice, 23, 546551. Sundberg, J., Andersson, M., & Hultqvist, C. (1999). Effects of subglottal pressure variation on professional baritone singers voice sources. Journal of the Acoustical Society of America, 105(3), 19651971. Sundberg, J., & Askenfelt, A. (1983). Larynx height and voice source: a relationship? In J. Abbs, & D. Bless (Eds.), Vocal fold physiology (pp. 307316). Houston, TX: College Hill. n, L. (1991). Common secrets of musicians and listeners: Sundberg, J., Friberg, A., & Fryde An analysis-by-synthesis study of musical performance. In P. Howell, R. West, & I. Cross (Eds.), Representing musical structure (pp. 161197). London, England: Academic Press.

3. Perception of Singing

105

Sundberg, J., & Gauffin, J. (1982). Amplitude of the voice source fundamental and the m (Eds.), The repreintelligibility of super pitch vowels. In R. Carlson, & B. Granstro sentation of speech in the peripheral auditory system, proceedings of a symposium (pp. 223228). Amsterdam, The Netherlands: Elsevier Biomedical Press. Sundberg, J., & Gramming, P. (1988). Spectrum factors relevant to phonetogram measurement. Journal of the Acoustical Society of America, 83, 23522360. rd, H. (1995). A singers expression of emotions in Sundberg, J., Iwarsson, J., & Hagega sung performance. In O. Fujimura, & M. Hirano (Eds.), Vocal fold physiology: Voice quality and control (pp. 217232). San Diego, CA: Singular Publishing Group. , F. M. B., & Himonides, E. (2011, June). Is intonation expressive? Poster Sundberg, J., La presented at 40th Annual Symposium on Care of the Professional Voice, Philadelphia, PA. Sundberg, J., Prame, E., & Iwarsson, J. (1996). Replicability and accuracy of pitch patterns in professional singers. In P. J. Davis, & N. H. Fletcher (Eds.), Vocal fold physiology, controlling complexity and chaos (pp. 291306). San Diego, CA: Singular Publishing Group. Sundberg, J., & Romedahl, C. (2009). Text intelligibility and the singers formanta relationship? Journal of Voice, 23, 539545. Sundberg, J., & Skoog, J. (1997). Dependence of jaw opening on pitch and vowel in singers. Journal of Voice, 11, 301306. n, M. (2010). What is twang? Journal of Voice, 24, 654660. Sundberg, J., & Thale Titze, I. R. (1992). Acoustic interpretation of the voice range profile. Journal of Speech and Hearing Research, 35, 2134. van Besouw, R. M., Brereton, J., & Howard, D. M. (2008). Range of tuning for tones with and without vibrato. Music Perception, 26, 145155. Vennard, W. (1967). Singing, the mechanism and the technic (2nd ed.). New York, NY: Fischer. Vennard, W., Hirano, M., Ohala, J., & Fritzell, B. (19701971). A series of four electromyographic studies. The National Association of Teachers of Singing Bulletin, October 1970, 1621; December 1970, 3037; FebruaryMarch 1971, 2632; MayJune 1971, 2230 Vurma, A., & Ross, J. (2002). Where is a singers voice if it is placed forward? Journal of Voice, 16(3), 383391. Weiss, R., Brown, W. S., Jr., & Morris, J. (2001). Singers formant in sopranos: fact or fiction? Journal of Voice, 15(4), 457468. r objektive Stimmbeurteilung. Folia Winckel, F. (1953). Physikalischen Kriterien fu Phoniatrica (Separatum), 5, 232252. Winckel, F. (1967). Music, sound, and sensation: A modern exposition. New York, NY: Dover.

4 Intervals and Scales


William Forde Thompson
Department of Psychology, Macquarie University, Sydney, Australia

I.

Introduction

Sounds that involve changes in pitch arise from a range of sources and provide useful information about the environment. For humans, the most salient sources of pitch change come from speech and music. Speech includes rising and falling pitch patterns that characterize vocal prosody. These patterns signal the emotional state of the speaker, provide a source of linguistic accent, and indicate whether the speaker is asking a question or making a statement. Music also involves continuous pitch changes but more often involves discrete changes from one pitch level to another, called intervals. Sequences of intervals characterize the melodies in Western and non-Western music and can carry important structural, emotional and aesthetic meaning (Crowder, 1984; Narmour, 1983; Thompson, 2009). For both speech and music, relative changes in pitch are highly informative. Indeed, it is possible that pitch changes in these two domains are processed by overlapping mechanisms (Juslin & Laukka, 2003; Patel, 2008; Ross, Choi, & Purves, 2007; Thompson, Schellenberg, & Husain, 2004). Music has the added feature that it emphasizes a collection of discrete pitch categories, reducing the audible frequency continuum into a manageable number of perceptual elements and encouraging abrupt changes in pitch. Collections of discrete pitch categories, or scales, provide a psychological framework within which music can be perceived, organized, communicated, and remembered. This chapter examines human sensitivity to pitch relations and the musical scales that help us to organize these relations. Tuning systemsthe means by which scales and pitch relations are created and maintained within a given musical traditionare also discussed. Questions addressed in this chapter include the following: How are pitch intervals processed by the auditory system? Do certain intervals have a special perceptual status? What is the relation between intervals formed by pitches sounded sequentially and those formed by pitches sounded simultaneously? Why is most music organized around scales? Are there similarities in the scales used in different musical systems across cultures? Is there an optimal tuning system?

The Psychology of Music. DOI: http://dx.doi.org/10.1016/B978-0-12-381460-9.00004-3 2013 Elsevier Inc. All rights reserved.

108

William Forde Thompson

II.

Pitch Intervals

Theories of pitch intervals in music can be traced back to the Ionian Greek philosopher Pythagoras of Samos (B570495 BC). His views are ingrained into many popular discussions of music and have inspired some composers to adopt a purely mathematical approach to composition (Navia, 1990). Of course, the insights of Pythagoras reflect an outmoded explanation of musical intervals that lacked the benefit of modern advances in the study of acoustics and the auditory system (Partch, 1974). Pythagoras is credited for making the discovery that the pitch of a vibrating string is directly related to its length (assuming equal tension), and for inspiring the idea that musical intervals correspond to string lengths that are related to each other by simple integer ratios, such as 2:1 (octave), 3:2 (perfect fifth), 4:3 (perfect fourth), and 5:4 (major third). When tension is held constant, the length of a string is inversely related to the frequency with which it vibrates when plucked. The greater the string length, the more slowly it sways back and forth when plucked, and the lower the frequency of the sound vibrations that propagate to the auditory system. Moreover, because pitch is related to the frequency of sound vibration on a logarithmic scale, ratios of frequencies describe the same musical intervals regardless of the absolute length of the strings. Galileo Galilei (15641642) and (independently) Marin Mersenne (15881648) showed that the frequency of vibratory motion, rather than string lengths per se, is lawfully associated with pitch. Galileo Galilei proposed that different tone combinations give rise to regular or irregular motions of the eardrum, and he surmised that dissonance occurs when the eardrum moves in an irregular manner. Mersenne outlined laws to explain how one can generate higher and higher pitches by increasing the amount of string tension and hence the frequency at which the string vibrates (as in tuning a guitar). These and other insights by Pythagoras, Galileo, and Mersenne set the stage for contemporary psychoacoustic models of music. We now know that the acoustic properties of tones are roped together with complex mechanisms of the auditory system, jointly shaping our perception and appreciation of melodic form, consonance, dissonance, and harmony (Helmholtz, 1877/1954). Pythagoras was correct in his belief that certain intervals have a special status, but this is true not because numbers constitute the true nature of all things. The special status of certain intervals emerges indirectly, reflecting a complex adaptation by the auditory system to the acoustic environment.

A. Simultaneous Intervals
Simultaneous pitch combinations are the foundation of musical harmony. Why do some pitch combinations sound better than others? Preference for consonance over dissonance is observed in infants with little postnatal exposure to culturally specific music (Trainor, Tsang, & Cheung, 2002; Hannon & Trainor, 2007). Even baby

4. Intervals and Scales

109

chicks share this inclination for consonance over dissonance (Chiandetti & Vallortigara, 2011), although some nonhuman animals are less attuned to the distinction (McDermott & Hauser, 2005). On balance, it appears that sensory factors provide a soft constraint on preferences for consonance, which can be modified by learning and enculturation. Consonance and dissonance play crucial roles in music across cultures: whereas dissonance is commonly associated with musical tension, consonance is typically associated with relaxation and stability (Butler & Daston, 1968; Vassilakis, 2005). The aesthetic appeal of consonant intervals was noticed early on by the Greeks. Pythagoras intuited that pitch combinations sound consonant if the lengths of strings that produce pitches are related to each other by small integer ratios. Intrigued by this correspondence, he advocated the sweeping notion that numbers could explain the universe (Tenney, 1988). More contemporary theories hold that the perceived consonance of intervals is determined by a number of factors, including sensory and acoustic factors, musical training, personal preference, and enculturation (Cazden, 1945; Parncutt, 1989, 2006). The perceptual outcomes of sensory and acoustic factors are referred to as sensory or psychoacoustic consonance and dissonance; the effects of musical training, personal preference, and enculturation are referred to as musical or tonal consonance and dissonance (Terhardt, 1984). Since the insights of Galileo and Mersenne, the most influential breakthrough in the study of musical acoustics was made by Helmholtz (1877/1954), who observed that consonant intervals (i.e., sensory consonance) are characterized by the absence of beating. Beating is an acoustic phenomenon in which concurrent tones that are similar in frequency but not identical drift in and out of phase, such that the amplitude of the summed waveform waxes and wanes in rapid succession. This oscillation between constructive and destructive acoustic interference is termed beating and occurs at a rate determined by the difference between the two frequencies. For example, combining the frequencies of 220 and 245 will give rise to 25 beats per second. The presence of beats does not in itself lead to dissonance. Very slow rates of beating sound neither pleasant nor unpleasant. Helmholtz contended that dissonance is equivalent to acoustic roughness, which occurs when beats are so rapid that they begin to blend together. Roughness and dissonance emerge when the rate of beats increases to about 2030 beats per second, which occurs when a frequency at about 400 Hz is combined with another frequency that differs by about a semitone (Plomp & Levelt, 1965). As the separation between two frequencies increases, the rate of beats increases, but beyond 2030 beats per second the beats become less salient and the two frequencies are perceived as distinct tones. Beats disappear when constructive and destructive interference is no longer registered by the auditory system. This failure of the auditory system to register fast amplitude modulations of sound waves can be explained by considering the mechanics of the basilar membrane and how it responds to sound. Pure tones (sounds consisting of one frequency only) excite specific regions on the basilar membrane: high frequencies cause

110

William Forde Thompson

maximal vibration of the membrane near the oval window and low frequencies ke sy, 1949). cause maximal vibration of the membrane near the apex (Von Be Two pure tones that are close in frequency generate overlapping responses in the basilar membrane. When this overlap has perceptual consequences, the frequencies are said to be within the same critical bandwidth (Greenwood, 1961a, 1961b). Perceptually significant overlap in the response of the basilar membrane to different frequencies leads to sensations of beating, roughness, and sensory dissonance. Roughness reaches a maximum when input frequencies are separated by about 30% to 40% of a critical bandwidth (Greenwood, 1991). The width of the critical band varies across the audible pitch range, whether measured in hertz or in semitones. For pitches below about 400 Hz, the width of a critical band varies in a manner that is roughly intermediate between a linear frequency scale (Hertz) and a logarithmic frequency scale (i.e., semitones). For pitches above 400 Hz, the width varies in a manner that is close to logarithmic. As illustrated in Figure 1, sensory dissonance should be evident across a wider pitch range (measured in semitones) at low pitches than at high pitches. For example, a simultaneous major third interval should create less sensory dissonance when played on the high notes of a piano than when played on the low notes of a piano. Plomp and Levelt (1965) also noted that for pure tone combinations, sensory dissonance occurs only for small frequency separations such as a semitone, and is not evident for larger intervals. Pure tones, however, do not occur in nature. When a string is plucked, it vibrates at multiple frequencies simultaneously, giving rise to a complex waveform. This complex waveform is still perceived as a unified tone, and each frequency component is referred to as a partial (i.e., part of the tone). The slowest rate of repetition, or fundamental frequency, is supplemented with a number of higher frequencies of vibration that are roughly multiples of the fundamental frequency. That is, if the fundamental frequency of a vibrating string has the value of f cycles per second

100 Hz 1 Sensory dissonance 200 Hz 400 Hz

Frequency of lower tone 600 Hz 1000 Hz

0 12-tet scale steps: Fourth Fifth Octave

Frequency interval

Figure 1 Sensory dissonance arising from simultaneous sine waves. In the upper pitch region, dissonance is mainly associated with small intervals. In the lower pitch region, dissonance is associated with both small and larger intervals. From Sethares (2005, p. 47).

4. Intervals and Scales

111

(or hertz), then there will also tend to be vibrations of the string at one or more of the frequencies 2f, 3f, 4f, 5f, and so on, creating a complex waveform. These higher frequencies are not heard as separate pitches but are grouped together with the fundamental frequency and heard as a single coherent entity. That is, the auditory system automatically binds together frequency components that are integer multiples of a common fundamental frequency (Micheyl & Oxenham, 2010). The pitch of any complex tone corresponds to the overall repetition rate of the complex waveform. The repetition rate is usually equivalent to the fundamental frequency and unaffected by the presence of harmonic overtones. It is also the same whether or not the fundamental frequency is present, as long as a number of overtones are present. Indeed, many handheld playback devices are incapable of reproducing low frequencies and yet listeners of these devices rarely notice that fundamental frequencies are missing (McDermott & Oxenham, 2008). Although overtones are not usually heard as individual pitches, they help to give the tone its characteristic timbre or sound quality and are crucial in understanding the nature of musical intervals. Figure 2 illustrates the patterns of overtones evident in the sound spectrum for a note played on a pan flute with a fundamental frequency at f 5 441 Hz and prominent overtones at frequencies of 3f, 5f, 7f, 9f, and 11f. Other instruments are associated with a different pattern of overtones. Higher frequencies that are exact multiples of the fundamental frequency, called harmonic overtones or harmonic partials, are implicated in the most familiar intervals in music. In particular, distances between harmonic partials are roughly equivalent to the most familiar musical intervals: the octave ( f to 2f ), fifth (2f to 3f ), fourth (3f to 4f ), major third (4f to 5f ), minor third (5f to 6f ), major second (8f to 9f ), and major sixth (3f to 5f ). It is tempting to surmise that the pitch relations that occur between the partials of individual tones are unconsciously internalized and expressed artistically in the form of music and other creative arts. For example, Ross et al. (2007) proposed that human preference for the most common intervals found in music arises from experience with the way speech formants modulate laryngeal harmonics to create
441 1322 Magnitude 2200 3095 4000 4840

1000

2000 3000 4000 Frequency in Hz

5000

6000

Figure 2 The spectrum from a pan flute with a fundamental frequency at f 5 440 Hz and prominent partials at approximately 3f, 5f, 7f, 9f, and 11f. From Sethares (2005, p. 111).

112

William Forde Thompson

different phonemes. Their approach was to analyze the spectra of vowels in neutral speech uttered by speakers of American English and Mandarin, and to compare the harmonics with the greatest intensity within the first and second formants. This procedure resulted in a distribution of all second formant/first formant ratios derived from the spectra of 8 vowels uttered by American English speakers and 6 vowels uttered by Mandarin speakers. On average, 68% of the frequency ratios extracted matched intervals found in the chromatic scale. In contrast, only 36% of randomly selected pairs of harmonics in the same frequency range matched intervals found in the chromatic scale. This comparison illustrates that musical intervals are not merely correlated with pitch intervals found in any harmonic (periodic) waveform, but reflect a bias that is specific to speech. This speech-specific bias suggests that, the human preference for the specific intervals of the chromatic scale, subsets of which are used worldwide to create music, arises from the routine experience of these intervals during social communication (Ross et al., 2007, p. 9854, see also, Han, Sundararajan, Bowling, Lake, & Purves, 2011). Most researchers, however, believe that the widespread use of certain intervals in music is encouraged by basic functions of the auditory system. First, Helmholtz (1877/1954) noted that the concept of roughness can be extended to combinations of complex tones, with the total amount of dissonance equal to some combination of the roughness generated by all interacting partials. When tones with harmonic spectra are combined, consonant intervals such as the octave and fifth have many partials in common, and those that are unique tend not to occur within a critical band and hence do not give rise to roughness. Complex tones that form dissonant intervals such as the diminished fifth (six semitones) have few partials in common, and some of their unique partials fall within the same critical band, giving rise to beating and roughness. Most significantly, the third and fourth partials of the lower pitch of a tritone interval are only one semitone removed from the second and third partials of the higher pitch of that interval. Plomp and Levelt (1965) calculated predicted levels of consonance and dissonance for combinations of tones consisting of six harmonic partials and with the first tone fixed at 250 Hz (see also Hutchinson & Knopoff, 1978; Kameoka & Kuriyagawa, 1969a, 1969b; Terhardt, 1974). The results of these calculations illustrate consonance peaks at intervals commonly used in Western music: minor third (5:6), major third (4:5), perfect fourth (3:4), perfect fifth (2:3), major sixth (3:5) and octave (1:2). Kameoka and Kuriyagawa (1969a, 1969b) developed an algorithm for estimating the total amount of dissonance in dyads of pure and complex tones. Their model assumed that dissonance is additive and dependent on loudness, and they relied on the power law of psychological significance to combine dissonance levels from different dyads of harmonics, yielding a final measure referred to as absolute dissonance. These mathematical models of dissonance are broadly in agreement with judgments of dissonance, but predictions break down when more or fewer harmonics are included in the model (Mashinter, 2006; Vos, 1986). Roughness may not be the sole determinant of consonance. Carl Stumpf (1890, 1898) suggested that consonance arises from tonal fusionthe tendency for

4. Intervals and Scales

113

combinations of tones to merge together. A related view is that consonance is enhanced by harmonicitythe extent to which the combined frequency components in an interval match a single harmonic series. Harmonicity is thought to play an important role in pitch perception. Terhardt (1974) proposed that the auditory system matches any incoming collection of partials, whether arising from a single tone or from combinations of tones, to the nearest harmonic template. If partials align with the harmonic series, the pitch is unambiguous. As the collection of partials deviates from harmonicity, the pitch becomes more ambiguous. According to Terhardt, harmonic templates develop through repeated exposure to the harmonic spectra of speech sounds, which predominate in the acoustic environment throughout human development. A more general possibility is that repeated exposure to any acoustic stimulus leads to the development of a template for that stimulus. Chord templates, for example, could develop even for tone combinations that do not align with a harmonic series, as long as those chords are repeatedly encountered in a persons musical environment. Such templates would allow trained musicians to identify highly familiar chords and may also underlie the perception of consonance and dissonance (McLachlan; 2011; see also, McLachlan & Wilson, 2010). For the octave interval, the partials of the higher-pitch tone coincide with the even numbered partials of the lower-pitch tone. The result of this combination is a new complex tone with a fundamental frequency equal to the original lower tone, but with a different amplitude spectrum and, hence, a different timbre. This coincidence of partials explains why tones separated by an octave are perceived to be highly similar, a phenomenon known as octave equivalence (Idson & Massaro, 1978; Kallman, 1982; Shepard, 1964). The octave interval is highly consonant and generates a strong sensation of pitch equivalent to the lower tone of the interval. Less consonant intervals tend to generate more ambiguous pitch sensations. Thompson and Parncutt (1997) modeled the pitch sensations arising from the perfect fifth interval, major third interval, and major triad (see also Parncutt, 1989). Their model assumes that simultaneous intervals generate multiple pitch sensations that extend beyond the fundamental frequencies of the tones, reflecting overtones, subharmonic tone sensations, and the effects of auditory masking. These pitch sensations vary in salience depending on the interval, with the most salient pitch sensation perceived as the (virtual) pitch of the complex. Tone combinations that generate highly salient and unambiguous pitch sensations should lead to greater fusion and, according to Stumpf, greater consonance. Predictions from the model were compared with goodness-of-fit ratings of probe tones presented immediately following the intervals. Results indicated a close correspondence between predictions and ratings, confirming the basic assumptions of the model. Most researchers believe that harmonicity plays an important role in pitch perception, but the role of harmonicity in consonance is less clear. One challenge is that harmonicity is associated with the absence of beating, so any association between harmonicity and consonance can be explained by the amount of beating among partials. To disentangle these factors, McDermott, Lehr, and Oxenham (2010)

114

William Forde Thompson

examined individual differences in preference ratings for beats and harmonicity to determine which factor correlates the most with preference for consonance. Surprisingly, their measure of beating preference did not correlate well with preference ratings for consonant and dissonant musical chords. That is, those who found beating unpleasant did not have a parallel dislike of dissonant intervals. Instead, preference for harmonicity correlated well with the preference for consonance (see also, Plack, 2010). Neuroscientific evidence is also compatible with the view that harmonicity exerts an influence on consonance, at least for isolated intervals. Bidelman and Krishnan (2009) used event-related potentials to index the perceived consonance of nine musical dyads. Each note of the dyad was a complex tone consisting of six harmonics (equal amplitude), and the stimulus intervals varied in size from 0 to 12 semitones (0, 1, 4, 5, 6, 7, 9, 11, 12). Consonance ratings of the nine intervals were also obtained by computing the number of times a given interval was selected as more pleasant sounding out of the 36 pairwise comparisons. The nine stimuli were presented dichotically in order to avoid effects of beating and other peripheral processing effects, and to isolate responses from central (brainstem) pitch mechanisms. Brainstem frequency-following responses (FFR) were then measured in response to the nine intervals. The FFR reflects phase-locked activity from a population of neural elements in the midbrain. It is characterized by a periodic waveform that follows the individual cycles of the stimulus. FFRs were analyzed based on their neural periodicity: a neural pitch salience value was calculated by comparing the neural periodicity for each interval with a period template. This pitch salience value estimates the relative strength of possible pitches present in the FFR. For example, perfectly harmonic spectra give rise to high pitch salience values. The pitch salience values closely aligned with consonance ratings of the intervals (r 5 0.81), suggesting that consonance is strongly correlated with neural periodicity. Dissonance intervals appear to be characterized by less coherent neural periodicity. In a later study, Bidelman and Krishnan (2011) used event-related potentials to model the perceived consonance of four prototypical musical triads: major triad, minor triad, diminished triad, and augmented triad. Again, pitch salience values accurately predicted consonance and dissonance ratings of the stimuli. The investigators argued that harmonically related pitch periods produce a high degree of coherence in their neural representation, leading to high levels of pitch salience. Dissonant triads, in contrast, evoke less coherent neural periodicity and lower pitch salience. It should be noted, however, that triads with high pitch salience are also very common and hence familiar. Increased familiarity may lead to higher consonance ratings and more efficient processing of the periodic content (McLachlan, 2011). Partials that are harmonically related tend to become fused, but fusion is also influenced by other factors such as coincident onset and offset characteristics. When two different tones are combined to form an interval, fusion is also enhanced when the tones have partials in common. For harmonic complex tones, the effects of roughness are correlated with both periodicity and fusion, so the relative

4. Intervals and Scales

115

contributions of these factors to consonance are entangled. One way to evaluate the importance of fusion independent of periodicity is to investigate the consonance of intervals that are formed by combining inharmonic tones. By manipulating the spectral components of artificial tones, one can create intervals that sound dissonant between harmonic tones but consonant between inharmonic tones. For example, consider p a complex tone consisting p of two inharmonic partials at frequencies f and 2f (where the distance of 2f corresponds to a tritone, or 6 semitones in the equally tempered chromatic scale). The spectrum itself is inharmonic: for most listeners it does not give rise to a clear pitch sensation and sounds somewhat like a chime. Nonetheless, as shown in Figure 3, if this tritone chime is combined with another tritone chime at progressively divergent pitch distances, the theoretical dissonant curve will show minima at 0 semitones, 6 semitones (the tritone interval), and 12 semitones (the octave). For these interval sizes, roughness or beating among partials is minimized. Thus, the absence of roughness in itself can lead to the perception of consonance, even for spectra that are inharmonic and give rise to ambiguous pitch sensations. Among isolated intervals, sensory consonance may be enhanced by tonal fusion, harmonicity, and the absence of roughness and beating. Additional factors may contribute to subtle aspects of interval perception such as the emotional distinction between major and minor thirds (Cook, 2007). However, music rarely involves the presentation of isolated intervals, and the influence of these factors on consonance becomes more complicated when intervals are considered in a musical context. David Huron observed that J. S. Bach tended to avoid tonal fusion when he was (presumably) pursuing perceptual independence of contrapuntal voices. First, simultaneous intervals that are most likely to fuse, such as octaves, fourths, fifths, are generally avoided (Huron, 1991a). The compositional strategy of avoiding consonant intervals does not lead to an increased risk of dissonance because listeners are encouraged to attend to horizontal structure. Second, when consonant intervals are unavoidable between different contrapuntal voices, they tend to be asynchronous (Huron, 2008). This compositional strategy is employed because it is difficult to hear out individual components of a chord in which
1 Sensory dissonance 0

7 5 6 Semitones

10 11 12

Figure 3 Dissonance curve for an inharmonic spectrum with partials at f and are evident at 1.21 (between 3 and 4 semitones) and 1.414 (a tritone). From Sethares (2005, p. 102).

p 2f. Minima

116

William Forde Thompson

components occur with synchronous onsets and offsets (Demany & Ramos, 2005). If there are too many consonant intervals with synchronous onsets, fusion might occur between tones that should be heard as part of different melodic voices, reducing their perceptual independence. As discussed by Wright and Bregman (1987), mechanisms of tonal fusion (vertical or harmonic structure) can work in opposition to mechanisms of auditory stream segregation that promote the perceptual grouping of tones over time (horizontal or melodic structure). Avoiding tonal fusion can be used to encourage the perception of horizontal (voicing) structure, and strengthening horizontal structure (for example, by restricting melodic lines to small intervals) can be used to suppress the potential dissonance that might occur between simultaneous voices in polyphonic music (for an extensive discussion, see Bregman, 1990, Chapter 5). Such effects lead to the surprising conclusion that the extent to which a given interval is perceived as dissonant depends upon how well the constituent tones are integrated into respective melodic voices (Huron, 1991b). Wright (1986) has argued that the historical increase in perceived dissonance in music corresponds less to an increased prevalence of dissonant harmonies but rather can be explained by reduced efforts to prepare for dissonant moments by emphasizing horizontal structure. Can fusion between simultaneous intervals really be avoided by emphasizing horizontal structure? Electrophysiological evidence suggests that concurrent melodies are represented separately in two-part polyphony regardless of musical training (Fujioka, Trainor, Ross, Kakigi, & Pantev, 2005). That is, forces of auditory streaming that support melodic processing or voicing may inhibit fusion of the simultaneous intervals that are formed when voices are combined (Huron, 2001). Tonal fusion cannot be avoided entirely, however. As more voices are added in polyphony, there is a tendency for some of the simultaneous tone combinations to fuse, leading to underestimates of the number of independent voices (Huron, 1989, see also Parncutt, 1993). In homophony, tonal fusion is emphasized, but research has yet to establish whether this emphasis can inhibit melodic processing entirely.

B. Sequential Intervals
Intervals formed by the succession of two tonesalso called melodic or sequential intervalsare the basis for melody. Melody, in turn, plays a profound role in music. Large sequential intervals, when they are followed by a change in direction, form the basis for gap-fill melodies (Meyer, 1973), and melodic leaps are perceived as points of melodic accent (Boltz & Jones, 1986; Jones, 1987). Conversely, melodies that consist of a sequence of small intervals sound coherent and cohesive (Huron, 2001). Sequences of melodic intervals comprise the fingerprint of music, and copyright infringement cases usually focus on melody and rarely on harmonic, rhythmic, or timbral attributes of music (Cronin, 19971998; Frieler & llensiefen & Pendzich, 2009). In the well-known court action Riedemann, 2011; Mu by Keith Prowse Music (KPC) against George Harrison, alleging copyright infringement for his hit song My Sweet Lord, crucial legal arguments hinged on

4. Intervals and Scales

117

a sequence of two descending intervals. The contentious intervals had been used in an earlier popular song Hes So Fine by the Chiffons, along with other melodic details (Southall, 2008). Sequential intervals have substantially different functions in music than simultaneous intervals. In Eugene Narmours (1990, 1992) implication-realization model music, all patterns of sequential intervals can be classified into a set of basic melodic structures. Because any melodic interval can evoke an implication for possible continuations, the tone that follows an interval can be construed as a realization that fulfills or denies the initial intervallic implication. As a melody unfolds, the pattern of fulfillments and denials of intervallic expectations shapes a listeners perception of structure (for a review and assessment of the model, see Thompson, 1996). Narmour proposed several principles of melodic implication, which have been evaluated in a wide range of empirical studies (e.g., Cuddy & Lunney, 1995; Krumhansl, 1995a, 1995b; Pearce & Wiggins, 2006; Schellenberg, 1996, 1997; Schellenberg, Adachi, Purdy, & McKinnon, 2002; Thompson, Balkwill, & Vernescu, 2000; Thompson, Cuddy, & Plaus, 1997; Thompson & Stainton, 1998). Although the details of Narmours principles have been questioned, empirical data generally support the essential hypotheses of the implication-realization model. Mechanisms underlying sequential and simultaneous intervals interact and overlap. Horizontal (melodic) structure can be used to attenuate the perceived level of potential dissonance in the ongoing harmony, and melodic intervals themselves connote differing levels consonance and dissonance even though they do not give rise to roughness and beating. In fact, the earliest use of the terms consonance and dissonance referred to successive melodic intervals (Tenney, 1988). The size of sequential intervals is typically smaller than that of simultaneous intervals in music. Figure 4 shows a histogram of the relative proportion of

Frequency of occurrence, %

40

30

20

10

4 5 6 7 8 Interval size, semitones

10

11

12

Figure 4 Histogram showing the relative occurrence of pitch intervals of different sizes in Western melodies (classical and rock: white bars; folk, dark bars). From Vos and Troost (1989).

118

William Forde Thompson

different pitch interval sizes in Western melodies, and indicates that small intervals (12 semitones) predominate in classical, rock, and folk music. This feature of sequential intervals arises because melodies are a type of auditory stream and are subject to principles of auditory stream segregation (Bregman, 1990). Sequential intervals within a melody are usually composed such that the component tones are perceived within the same auditory stream (Huron, 2001). The neural mechanisms that support auditory streaming are still not fully understood, but any comprehensive model would have to account for both primitive and schema-based segregation, including the role of attention (Carlyon, 2004). BidetCaulet and Bertrand (2009) proposed that auditory streams are determined by the separation of neural populations activated by successive tones. If the responses to two successive tones exceed a certain threshold of activation within the same neural population, one stream is perceived; if responses to the two tones exceed this threshold of activation in nonoverlapping neural populations, two streams are heard (see also, Micheyl et al., 2007). In a musical context, the perception and appreciation of melodic intervals are likely to be influenced by both mechanisms that support auditory streaming and mechanisms underlying consonance and fusion for simultaneous intervals. Tones that blend well together as a simultaneity also tend to work well when played in sequence. For example, the most consonant simultaneous intervalan octaveoften occurs melodically, as in the first two notes of the song Over the Rainbow or The Christmas Song (Chestnuts roasting on an open fire). The second most consonant intervala fifthoccurs prominently in the childrens song Twinkle Twinkle and Scarborough Fair; the major third occurs melodically in The Itsy-Bitsy Spider; the major sixth in My Bonnie; and the perfect fourth in Oh Tannenbaum. This coincidence suggests that mechanisms that support consonance and dissonance in simultaneous intervals may be engaged when the tones of those intervals are played in sequence. Neural responses to the initial tone of a melodic interval may linger beyond the offset of that tone (i.e., in working memory) and interact with neural responses to a subsequent tone. Sequential tone combinations cannot give rise to physical beating and roughness except in highly resonant environments, such as churches. However, the combined neural activity of sequential tones occurring within working memory could potentially be subject to periodicity detectors. An alternative explanation is that persistent exposure to consonant simultaneous intervals leads to expectations and preferences for those intervals melodically. One feature that distinguishes the perception of simultaneous and sequential intervals is that sequential intervals are coded in (at least) two ways: as a magnitude of pitch change and as a pitch contour. The magnitude of sequential intervals is retained with high efficiency in long-term memory for familiar melodies (Attneave & Olsen, 1971; Dowling & Bartlett, 1981) but is poorly retained in memory for novel melodies (Dowling, 1978). Pitch contourthe direction of change from one note to another over timeis salient for novel melodies (Dowling & Fujitani, 1970; Edworthy, 1985). Children and infants also rely primarily on contour when listening to and remembering melodies

4. Intervals and Scales

119

(Chang & Trehub, 1977; Morrongiello, Trehub, Thorpe, & Capodilupo, 1985; Pick, Palmer, Hennessy, & Unze, 1988; Trehub, Bull, & Thorpe, 1984). It is often suggested that the mechanisms underlying melody processing may be engaged for domains other than music, such as speech intonation (Ilie & Thompson, 2006, 2011; Miall & Dissanayake, 2003; Patel, 2003, 2008; Thompson et al., 2004; Thompson & Quinto, 2011). Ilie and Thompson (2006, 2011) found that manipulations of basic acoustic attributes such as intensity, pitch height, and pace (tempo) have similar emotional consequences whether imposed on musical or spoken stimuli. Thompson et al. (2004) showed that administering 1 year of piano lessons to a sample of children led to an increase in sensitivity to emotional connotations of speech prosody. Finally, there is convergence of statistical data on pitch changes that occur in speech and melodies. For example, Patel, Iversen, and Rosenberg (2006) compared the average pitch variability in French and English speech and folk songs. Spoken French had significantly lower pitch variability from one syllable to the next than spoken English, and a parallel difference was observed for French and English folk songs. The neural substrates for processing contour and interval size appear to be different (Liegeois-Chauvel, Peretz, Babei, Laguitton, & Chauvel, 1998; Peretz & Coltheart, 2003; Schuppert, Munte, Wieringa, & Altenmuller, 2000). This modularity view is supported by findings of selective impairments in music recognition ability after brain injury or among individuals with congenital difficulties (see Chapter 13, this volume). However, such dissociations have ambiguous implications. For example, accurate processing of precise intervals may depend on the successful operation of multiple computations such that damage to any one leads to impaired interval perception. Contour perception may involve fewer or less precise computations and may therefore be less susceptible to impairment following brain injury. Moreover, if the extraction of contour has more widespread application than the extraction of interval size (e.g., in speech prosody perception), then it may be robust to degradation, leading to apparent dissociations between contour and interval size following brain injury. McDermott, Lehr, and Oxenham (2008) provided evidence that the capacity to extract contour is a general property of the auditory system. They presented participants with a sequence of five tones followed by a second sequence that was transposed up or down in pitch. The five tones varied in one of three acoustic attributes: pitch (as in a melody), timbre, and intensity. The task was to judge whether the pattern of variation (contour) in the two stimuli was the same or different. One finding was that contours of timbre and intensity were recognized just as well as contours of pitch, suggesting that relative pitch is merely one example of a general sensitivity to relational information in the acoustic environment. Moreover, participants could map pitch contours to similar contours in timbre or intensitya capacity that can also be extended to visual contours (Prince, Schmuckler, & Thompson, 2009; Schmuckler, 2004). That is, increases in brightness and intensity were heard as similar to increases in pitch, but dissimilar to decreases in pitch (see also Neuhoff, Kramer, & Wayand, 2002). These findings suggest that contour is represented by a general code that permits comparison

120

William Forde Thompson

between different acoustic attributes. Such a general representation would likely receive input from change-detection mechanisms tuned to specific attributes of sound. With respect to pitch, Demany, Semal, and Pressnitzer (2011) provided evidence that two types of change-detection mechanisms are engaged when the auditory system is presented with tone sequences. One mechanism involves an implicit comparison of pitch information made by automatic and direction-sensitive frequency-shift detectors, and may contribute to a representation of pitch contour. The other involves explicit comparisons of tones and is sensitive to the magnitude of a frequency change (interval size). Both mechanisms may be implicated in the formation of mental representations of melodies (see also, Demany, Pressnitzer, & Semal, 2009). In view of the prominent role of pitch intervals in music, one may assume that the auditory system has a specialized capacity to compare two different sounds on the basis of pitch. To evaluate this possibility, McDermott, Keebler, Micheyl, and Oxenham (2010) examined the precision of interval perception using a simple discrimination task. Interval acuity was evaluated for three auditory attributes: pitch, brightness (timbre), and loudness. Interval thresholds were then defined relative to units of just-noticeable-difference (JND) for that attribute (calculated as the JND for interval size discrimination divided by JND for discrimination of individual levels of the attribute). When interval acuity was calculated in this manner, however, it was actually worse for pitch than for the attributes of brightness and loudness. The primary reason for this outcome is that the JND for pitch was very low, and much lower than that for brightness and loudness. Nonetheless, the result suggests that the auditory system may not be specifically designed for discriminating melodic intervals per se, but has special properties that permit fine-grained pitch resolution. Indeed, even for musically trained participants, pitch-interval thresholds were generally greater than a semitone. If listeners are unable to discriminate intervals that differ in size by a semitone, then how are melodies perceived and remembered? Shifting the pitch of a single note of a melody is highly noticeable, even when it only alters the original pitch by one semitone. Several decades ago, Dowling (1978) suggested that unfamiliar melodies are not encoded as a sequence of intervals but as a melodic contour attached to an underlying scale. Only for familiar melodies are interval sizes retained, and the mechanisms that permit their retention in memory are the subject of current model building (Deutsch, 1999; Chapter 7, this volume). Given the complex neural architecture of the auditory system, the abstraction of specific intervals is feasible (Deutsch, 1969). However, for musically naive listeners, the capacity to perceive and discriminate melodic intervals may arise from less specific computations and abilities, such as coarse-grained sensitivity to interval size or overall pitch distance, contour perception, the capacity to infer an underlying scale, and fine-grained pitch discrimination. Disentangling these capacities is a significant challenge for model building and for designing studies of interval perception. The intriguing and complex nature of interval perception was underscored by a series of experiments conducted in our lab (Russo & Thompson, 2005a, 2005b;

4. Intervals and Scales

121

Thompson, Peter, Olsen, & Stevens (2012); Thompson & Russo, 2007; Thompson, Russo, & Livingstone, 2010). These studies illustrate that the perceived size of isolated melodic intervals is dependent on a range of contextual factors such as timbre, intensity, overall pitch height, and even visual signals arising from the musicians who are producing the intervals. Russo and Thompson (2005a) presented ascending and descending sequential intervals to listeners, who rated the size of each interval on a scale from 1 to 5. The spectral centroid (the brightness of the timbre) of each component tone of the interval was manipulated to create congruent and incongruent conditions. In the congruent condition, the spectral centroid of the two tones of the interval mirrored the pitch of those tones. For example, in the ascending interval condition, the initial tone had a lower spectral centroid and the second tone had a higher spectral centroid. In the incongruent condition, the spectral centroid of the two tones of the interval conflicted with the pitch of those tones. For example, in the ascending interval condition, the first tone had a higher spectral centroid and the second tone had a lower spectral centroid. Ratings of interval size were influenced by the timbre of the component tones, with significantly higher ratings for congruent conditions than incongruent conditions. The results suggest that pitch and timbre are perceived nonindependently (Garner, 1974; Melara & Marks, 1990), such that interval size judgments are weighted perceptually by the timbral context. A related implication is that judgments of interval size engage a general process of evaluating the psychological distance between tones. In another study, Russo and Thompson (2005b) asked musically trained and untrained participants to provide magnitude estimates of the size of melodic intervals presented in a high or a low pitch register, using a scale from 1 to 100. Ascending and descending intervals were created by using pitches that differed from each other by between 50 cents (one half of a semitone) and 2400 cents (two octaves). Participants were then told that the smallest and largest intervals should be assigned values of 1 and 100, respectively. Estimates of interval size were dependent on both the pitch height and direction of the interval. Ascending intervals were judged as larger than descending intervals when presented in a high pitch register, but descending intervals were judged as larger than ascending intervals when presented in a low pitch register. One interpretation of this interaction relates to the finding that listeners expect intervallic pitch movement towards the center of the pitch register (Huron, 2006; von Hippel & Huron, 2000). Unexpected movement (away from the center of the pitch register) may be perceived as more salient than movement toward an expected event, leading to higher estimates of interval size. We also observed significant effects of music training. For intervals up to an octave, there was greater differentiation of interval sizes by musically trained than untrained listeners. In this range, only trained listeners judged interval size in a manner consistent with a logarithmic mapping of fundamental frequency. For intervals larger than an octave, trained and untrained listeners showed less differentiation of interval sizes, and neither group judged intervals according to a logarithmic mapping of fundamental frequency. In other words, the effects of musical training were not observed for intervals larger than an octave, but were restricted to intervals that occur frequently in music.

122

William Forde Thompson

This divergence of interval size judgments from the logarithmic scale is reminiscent of the early psychophysical studies that led to the mel scale. Stevens, Volkmann, and Newman (1937) defined a pure tone of 1000 Hz at 40 dB above threshold as 1000 mels, and the pitch in mels of other frequencies was determined by asking musically untrained participants to adjust a comparison pure tone until it was perceived as one half the pitch height of a standard tone (method of fractionation). The mel scale and the logarithmic scale are approximately equivalent below 500 Hz, but they diverge above 500 Hz where perceptually equivalent interval sizes (in mels) span progressively smaller frequency ratios (see also Beck & Shaw, 1961; Greenwood, 1997; Stevens & Volkmann, 1940). Tonal context also affects judgments of pitch relations. Krumhansl (1979) asked listeners to judge the similarity between pairs of tones presented immediately following key-defining musical contexts. By focusing on similarity ratings instead of interval size or categorical labels, it was possible to elicit influences on interval perception that are not evident for other types of judgments. The pattern of ratings revealed that a musical context greatly affects the psychological relationship between tones. Tone pairs taken from the tonic triad of the defining key (first, third, or fifth scale degrees of a major scale) were judged as closely related. However, when the same intervals were not members of the tonic triad, the perceived similarity between the tones was lower. Similarity was also affected by the order in which tones were presented. Tones less related to the tonality (e.g., nondiatonic tones) were judged more similar to stable tones within the tonality (e.g., members of the tonic triad) than the reverse temporal order, illustrating a kind of prototype effect. In short, intervals are perceived in different ways depending on their function within an underlying tonal context and do not depend merely on psychoacoustic factors. Geometric models of pitch also imply that a complete psychological description of pitch relationships requires multiple dimensions (see also Deutsch, 1969, 1992; Chapter 7, this volume; Krumhansl, 1990; Krumhansl & Kessler, 1982; Shepard, 1964, 1982a, 1982b, 2011). It has often been suggested that melodies imply movement (Boltz, 1998; Jones, Moynihan, MacKenzie, & Puente, 2002; Repp, 1993; Shepard, 2011; Shove & Repp, 1995), and melodic intervals are often described using movement-based metaphors such as rising and falling. Do melodic intervals have motional qualities? According to Common Coding theory, movement areas of the brain may be activated if music is perceived in terms of underlying or associated actions (Leman, 2009; Overy & Molnar-Szakacs, 2009; Prinz, 1996; Thompson & Quinto, 2011; Zatorre, Chen, & Penhune, 2007). Recent investigations in our lab led by Paolo Ammirante provided evidence that pitch changes interact with timing mechanisms in the motor system (Ammirante & Thompson, 2010, 2012; Ammirante, Thompson, & Russo, 2011). These studies used a continuation-tapping paradigm, whereby participants tapped in synchrony with a pacing signal and then attempted to continue tapping at the same rate once the pacing signal was removed. To examine the role of pitch changes on the motor system, each tap in the continuation phase triggered a sounded tone. The pitches of these tones were then manipulated to form melodic patterns. Changes in pitch

4. Intervals and Scales

123

systematically affected the timing of the taps that followed. Where a triggered tone implied faster melodic motion (larger melodic leaps within the same amount of time) the intertap interval (ITI) that the tone initiated was shorter (faster taps); where a triggered tone implied slower melodic motion, ITI was longer. That is, the implied melodic motion arising from intervals of different sizes was reflected in the timing of actions. The role of movement in interval perception is also suggested by my research on the facial expressions of musicians (Thompson & Russo, 2007; Thompson, Russo, & Livingstone, 2010; Thompson, Russo, & Quinto, 2008). This work indicates that the perception of melodic intervals is significantly affected by the facial expressions of the musicians who are producing those intervals. Thompson et al. (2010) asked participants to watch a musician singing a melodic interval and to judge the size of that interval on a scale from 1 to 7. Only the face of the musician was visible. We first confirmed that the facial expressions alone, even with no sound available, could convey reliable information about the size of the melodic interval being sung (see also Thompson & Russo, 2007). Visual and auditory signals were then manipulated such that the visual signal taken from a large sung interval would be synchronized with the auditory signal taken from a small sung interval, and vice versa. Results confirmed that both auditory and visual channels influenced ratings of interval size. Facial measurements revealed that musicians made a number of subtle movements of the head and eyebrows, to which participants were highly sensitive. Additional manipulations confirmed that visual information arising from singers is automatically and unconsciously taken into consideration when evaluating interval size. Such findings underscore the complex and multimodal nature of music perception and suggest that analytic judgments of interval categories may provide a limited understanding of music experience (see also, Makeig, 1982).

C. Limits and Precision of Relative Pitch


Pitch relationships play a central role in music perception and performance: they are readily perceived and remembered by listeners with or without musical training, and the capacity to produce conventional intervals on a musical instrument is a basic skill that musicians acquire early in training. How well can listeners discriminate intervals and how accurately can musicians produce them? Do some intervals have a special psychological status? One of the most basic limits to interval perception is the pitch region. At the lowest end of the audible spectrum, intervals are difficult to discriminate because many partials of individual pitches fall within the same critical band, giving rise to indistinct or rumbling pitch sensations. Within the middle of the audible range, individual pitches give rise to clear pitch sensations and intervals are readily extracted. Pitches evoked by complex tones are clearest when the fundamental lies in a region centered at 300 Hz (Terhardt, Stoll, & Seewann, 1982a, 1982b). This region of pitch clarity may well influence musical practice. Huron (2001) reported that the average

124

William Forde Thompson

notated pitch across a large corpus of Western and non-Western music is roughly Dx4, which is very close to the region that evokes the clearest pitch sensations. At the upper end of the spectrum, especially beyond about 5000 Hz, pitch relations again become indistinct (Attneave & Olson, 1971; Ohgushi & Hatoh, 1992; Semal & Demany, 1990). One explanation for this loss of relative pitch is that temporal coding of pitch underlies the perception of music, and not spectral or place coding. Temporal coding of pitchthe phase-locked firing of auditory neurons to the stimulus waveformoccurs up to about 5 kHz, which coincides with the upper limit of relative pitch (Moore, 2004; van Noorden, 1982). Place codingwhich is related to the place of maximum excitation by pitches on the basilar membraneallows pitch discrimination across a greater range of frequencies. Thus, above 5 kHz, where temporal coding is absent but place coding remains, listeners are still capable of ordering pitches on a scale from low to high but are unable to differentiate specific intervals or hear pitch sequences as musical signals (Houtsma, 1984; Semal & Demany, 1990; Oxenham [Chapter 1, this volume] provides an extended discussion of place and temporal theories of pitch perception). A number of psychophysical methods have been adopted to explore the limits and precision of musical interval perception, as reviewed extensively by Burns (1999, see also Zarate, Ritson, & Poeppel, 2012). Houtsma (1968) adopted a paired-comparison discrimination task to estimate JNDs in the size of musical intervals. In this task, participants are presented with two intervals and must indicate which is larger (two-alternative forced-choice). The pitch of the first tone was randomized to force participants to base their responses on interval size rather than absolute pitch values. The average JND for the octave was 16 cents, and JNDs for other intervals of the chromatic scale ranged from 13 to 26 cents. In the method of adjustment, individuals are presented with a pair of tones, either in sequence or simultaneously. One of the tones is fixed and the other can be adjusted. Participants are instructed to adjust the variable tone such that the pitch of the two tones matches a certain interval. For example, a participant may be asked to adjust the variable tone such that the interval between the two tones is an octave. Relative pitch possessors are quite consistent across repeated adjustments. For sequential or simultaneous octaves, the average intrasubject standard deviation of repeated adjustments is approximately 10 cents if the two tones are sinusoidal and less if they are complex tones (Burns, 1999; Sundberg & Lindquist, 1973; Terhardt, 1969; Ward, 1954). Based on his own research and a review of research, Burns (1999) concluded that when individuals adjust tones to produce a target interval there is a tendency to compress small intervals of four semitones or fewer (adjust narrower than equal tempered) and to stretch large intervals of eight semitones or greater. However, such effects depend on the precise interval involved. For example, compression is clearly observed for the ascending and descending minor second interval (Vurma & Ross, 2006) but not for the major second interval (Loosen, 1993; Ross, 1984). The inclination to compress or stretch intervals also depends on the frequency region in which the interval is played (Rosner, 1999).

4. Intervals and Scales

125

The octave stretch phenomenon has received especially close attention (Burns, 1999; Dowling & Harwood, 1986). Intervals defined by the frequency ratio of 2:1 are perceived to be smaller than an octave, and intervals judged to be accurate instances of the octave are characterized by frequency ratios that are slightly larger than 2:1. This effect is more evident for sequential intervals than simultaneous intervals (Burns, 1999), is observed across cultures (Burns, 1974), and has been confirmed using a range of psychophysical techniques (Dobbins & Cuddy, 1982; Hartmann, 1993). Although a number of explanations have been proposed (e.g., Ohgushi, 1983; Terhardt, 1971; Young, 1952), the phenomenon has yet to be fully understood. In music performance, technical skill and expressive intentions determine whether intervals are performed accurately (Vurma & Ross, 2006). For variablepitch instruments such as the violin, music performance involves a continuous process of adjusting the pitches of tones in the music. These adjustments, called intonation, are often aimed at accurate rendering of notated intervals but intervals may be intentionally compressed or stretched for expressive purposes. Some genres such as Romantic music permit significant use of expressive intonation, whereas other genres are associated with high intonation accuracy. For example, Hagerman and Sundberg (1980) reported that the average intonation accuracy in a sample of expert barbershop songs was less than 3 cents. The aesthetic impact of compressing or stretching intervals occurs without interfering with the essential identity of those intervals. This outcome is possible primarily because listeners expect the interval sizes typically performed by musicians and not the precise interval sizes defined by equal-temperament tuning (Vurma & Ross, 2006). Another factor is that musically trained listeners perceive intervals categorically (Burns, 1999; Burns & Ward, 1978). Two observations support this claim. First, when intervals are adjusted by small amounts to be smaller or larger, identification functions by musically trained listeners tend to show sharp category boundaries. For example, if a series of intervals are presented that are intermediate between a major second and a minor third, listeners tend to perceive a repeated presentation of the smaller interval, followed by an abrupt change in the interval category, and then a repeated presentation of the larger interval. Second, if the size of two intervals differs by a fixed amount (e.g., 30 cents), they will be discriminated better if they are perceived to be within different interval categories (e.g., minor third and major third) than if they are perceived to be within the same interval category (two instances of a major third). Siegel and Siegel (1977) used magnitude estimation to examine categorical perception of melodic intervals. Six musicians provided magnitude estimations of 13 melodic intervals that ranged in size from roughly 6 to 8 semitones in 0.2-semitone increments. All participants identified in-tune intervals with greater than 95% accuracy. However, their magnitude estimates revealed an uneven capacity to discriminate intervals. Magnitude estimates of interval size did not increase in proportion with the stimulus magnitude but showed discrete steps corresponding to interval categories. They also judged 63% of the intervals to

126

William Forde Thompson

be in tune even though most of them ( . 75%) were out of tune with respect to equal-temperament tuning. Categorical perception has also been observed for simultaneous intervals. Zatorre (1983) presented seven musicians with simultaneous intervals consisting of pure tones over a 100-cent range spanning from a minor third (300 cents) to a major third (400 cents). The study adopted a two-alternative forced-choice paradigm as well as a rating-scale identification paradigm. Category boundary effects were observed in that discrimination was better for pairs straddling the boundary between two interval categories than for pairs of intervals near the endpoints of the stimulus continuum (see also Zatorre & Halpern, 1979). Such findings illustrate that regions along the continuum of interval size exist where discrimination of simultaneous intervals is enhanced, and these regions are associated with the presence of category boundaries along this continuum. On first glance, evidence for categorical perception of musical intervals seems analogous to results reported for phonemes in speech, but there are notable differences. Most significantly, speech categories appear very early in development (Eimas, Siqueland, Jusczyk, & Vigorito, 1971) and infants exhibit perceptual sensitivities for phoneme boundaries that are not even used in their parents language (Eimas & Corbit, 1973; Streeter, 1976). In contrast, musical interval categories seem to emerge only following explicit music experience or training. Given such differences, it is premature to conclude that the very same mechanisms underlie categorical effects in music and speech. Researchers have also examined the ability of musically trained participants to identify intervals in isolation or in a musical context. Taylor (1971) presented participants with 25 chromatic ascending and descending intervals including unison. Intervals were presented in isolation and embedded in a melody. Error rates were higher when intervals were presented in a melodic context than when they were presented in isolation. Moreover, the error rate was not correlated with the subjectively judged tonal strength of the melodies. These are surprising results given that musical contexts should allow intervals to be encoded both as musical distances (e.g., perfect fourth) and as scale degrees on an underlying scale (e.g., tonic to subdominant). Moreover, music training enhances neural encoding of musical intervals (Lee, Skoe, Kraus & Ashley, 2009), and the majority of time spent during music training involves working with and attending to full musical contexts. Finally, an advantage for isolated intervals is not observed when other measurement techniques are adopted (Rakowski, 1990). In short, findings on discrimination and identification of intervals seem to depend on the method of evaluation. A question surrounding all studies of interval discrimination and identification is whether it is reasonable to use equal-temperament tuning as the standard for classifying intervals as in tune or out of tune, when it is known that expressive intonation rarely aligns precisely with the intervals defined by equal-temperament tuning. ` s (1958/1988) compared detection rates for two types of mistuned intervals France in a musical context. In one condition, mistuned intervals were contracted or expanded with respect to equal-temperament tuning in a manner consistent with expectations based on intonation measurements taken from performed music. In the

4. Intervals and Scales

127

other condition, mistuned intervals were contracted or expanded counter to expectations based on such measurements. Participants were more accurate at detecting mistuned intervals in the second condition. The finding highlights the difficulty in establishing an absolute standard against which tuning errors can be defined. As demonstrated in psychoacoustic studies by Rakowski, melodic intervals are psychological entities and their identities are associated with a range of values (Rakowski, 1976, 1982, 1985a, 1985b, 1990, 1994).

III.

Scales and Tuning Systems

Melodic intervals are also fundamental to scalesthe set of discrete pitches used in most music across cultures. What are the functions of scales? Humans can distinguish more than 200 pitches within a single octave in the mid-range of hearing, but the pitches used in music are typically restricted to a small number of pitch categories. Scales divide the continuum of pitch into a discrete and manageable number of elements that are used repeatedly. There is considerable diversity of scales across musical cultures but most are constructed from five to seven elements per octave and permit the formation of one or more consonant intervals such as the octave, fifth, and fourth. Many also allow differentiation of up to a dozen different interval sizes within each octave. The diatonic scale, for example, permits the formation of melodic intervals ranging in size from 1 to 12 semitones in any octave. The presence of precise and meaningful distinctions between interval sizes is a unique property of music. Other attributes of sound, such as timbre and intensity, are not formally represented in terms of distances between exemplars. The concept of a scale can be defined from physical, mathematical, and psychological perspectives. From a physical perspective, it refers to the set of pitches that can be produced on a musical instrument given a certain tuning system. From a mathematical perspective, one can use a group theoretic description of pitch sets as a way of assessing the resources available to any pitch system such as the equaltempered 12-fold division of the octave (Balzano, 1977, 1980, 1982). From a psychological perspective, a scale refers to a mental representation of regularities in pitch that is activated when one listens to music. Such a representation would determine, for example, whether incoming tones are perceived to be grammatical. It also helps listeners to determine the different functions of tones in a melody, thereby facilitating their encoding in memory. Trained and untrained listeners readily extract the underlying scale from music, even after just a few tones (Cohen, 1991). It is unclear whether this capacity to infer the underlying scale plays a significant role during music listening, however, because virtually all people from an early age learn to sing the scales of their musical culture. It is possible that scales are cognitively important only to the extent that listeners internalize the frequency of occurrence of pitches in an established key (Oram & Cuddy, 1995; Krumhansl, 1985, 1990). Within a statistical learning framework, it is unnecessary to assume there is a specialized process in the brain that categorizes incoming tones as members or nonmembers of a scale. Instead, the neural circuitry that responds to pitch develops in a way that mirrors

128

William Forde Thompson

the probability of occurrence of pitches and pitch classes. Scale notes occur more frequently than nonscale notes, so are more expected and are processed more efficiently. Using the unfamiliar Bohlen-Pierce scale, Loui, Wessel, and Hudson Kam (2010) created musical grammars from which melodies were composed. Several decades ago, Heinz Bohlen designed the Bohlen-Pierce scale to be distinct from Western scales but still to give rise to a sense of tonality. Participants were exposed to melodies for 2530 minutes and they were then evaluated for recognition, generalization, and statistical learning. Statistical learning was assessed by asking participants to rate the goodness of fit of probe tones following melodies in the new grammar. Both musically trained and untrained participants could recognize individual melodies with high accuracy, and they generalized their knowledge to new melodies composed from the same grammar. Probe-tone ratings corresponded to the frequency of occurrence of different pitches, illustrating sensitivity to statistical properties of the melodies. In a landmark paper, Dowling (1978) emphasized the psychological significance of scales. He presented participants with a target melody followed by a comparison melody and asked them to indicate if the melodies were the same or different. Comparison melodies were one of three kinds: (a) exact transpositions of the target melody; (b) transpositions that conformed to the scale and contour of the target melody but involved changes to the precise intervals involved (i.e., tonal answers), or (c) atonal comparison stimuli. Target stimuli were matched to exact transpositions or tonal answers, but they were rarely confused with atonal comparison stimuli. Based on these and related findings, Dowling proposed that novel melodies are mainly represented by scale and contour, rather than by the precise intervals involved. Most Western and non-Western scales permit the formation of consonant intervals. By combining notes of the diatonic major scale, one can create intervals such as an octave, fifth, fourth, third, and sixth. These intervals are consonant primarily because they are represented in the spectra of complex periodic waveforms, including the human voice and many musical instruments. In turn, when two tones with complex harmonic spectra are combined at varying pitch distances, local minima in dissonance and maxima in fusion occur when the distance between tones matches the distance between partials of individual spectra. Just intonation (tuning) is used to create scales that optimize consonance between scale tones. Given the first scale note, or tonic, just intonation optimizes consonance in intervals by tuning other scale notes such that their fundamental frequencies relate to that of the tonic by small integer ratios: octave (2:1), fifth (3:2), fourth (4:3), major third (5:4), minor third (6:5), major sixth (5:3), and minor sixth (8:5). One limitation of just intonation scales is that they are impossible to achieve fully: if the sixth scale degree is tuned according to the ratio of 8:5, then the interval between second and sixth scale degrees will not be consistent with the desired ratio of 3:2. A second limitation of just-intonation scales is that they are inherently key specific. They work well in the key to which the scale is tuned, and in related keys, but they sound unpleasant when played in distant keys. For example, in a C

4. Intervals and Scales

129

major scale created by just tuning, an Fx major chord has a fifth interval of 722 cents (roughly 20 cents more than a justly tuned fifth). Of course, this concern mainly applies to fixed-pitch instruments such as keyboard, where the tuning of individual notes cannot be adjusted to suit a new key. Pythagorus attempted to construct a complete musical scale by moving successively up and down by fifths. Moving up from an initial tone by a perfect fifth interval 12 times yields a new tone with a new fundamental frequency that relates to that of the initial tone by the ratio (3 4 2)12. These 12 upward steps lead back to the pitch class of the initial tone in an equal-tempered system (7 octaves higher), but not in just intonation. When the pitch defined by (3 4 2)12 is transposed back down by seven octaves, the ratio becomes 531441 4 524288, or 23 cents sharp of the unison. This interval is called the Pythagorean comma and is illustrated in Figure 5. Equal temperament tuning is the practice of distributing this discrepancy equally among the 12 tones of the chromatic scale. Differences between equal temperament tuning and just intonation are subtle but can usually be detected by careful listeners. The popularity of the equal tempered scale among highly trained Western musicians raises questions about the central role of beating in dissonance (see also, McDermott et al., 2010). Equal temperament and just tuning are designed to maximize the number of consonant intervals between sounds with harmonic spectra, including the human voice and many musical instruments. However, several kinds of musical instruments have inharmonic timbres such as gongs, bells, drums, singing bowls, and wooden blocks. For most Western listeners, the pitch sensations arising from harmonic instruments are clearer than those arising from inharmonic instruments, but both types of instruments can be systematically tuned. The spectra of the instruments that predominate in a musical culture influence how those instruments are tuned and, hence, the scales that become associated with the music. Sethares (2005) noted a close correspondence between intervals, scales, and spectral properties of instruments. In traditions that rely primarily on instruments with inharmonic spectra, musical scales tend to be very different from Western diatonic major and minor scales, precisely because they permit the formation of the intervals that are found within the spectra of those inharmonic instruments.
A D E

* * *E

* * * C
G D

* * *

G C

* *A *
F*

Figure 5 The spiral of fifths, illustrating that a complete scale cannot be created by progressively tuning pitches using the frequency ratio 3:2. After 12 perfect fifths, the new pitch is slightly displaced from the original pitch class by an amount known as the Pythagorean comma. From Sethares (2005, p. 55).

*A * B *E

130

William Forde Thompson

The bonang is a musical instrument used in the Javanese gamelan and consists of a collection of small gongs. According to Sethares (2005), when the spectrum of a bonang is combined with a harmonic tone, it generates a dissonance curve with minima near the steps of an idealized slendro scaleone of the two essential scales in gamelan music. Another instrument used in gamelan musicthe saronconsists of seven bronze bars placed on top of a resonating frame. When the spectrum of a saron is combined with a harmonic tone, it generates a dissonance curve with minima near the steps of a pelog scalethe other essential scale in gamelan music. Based on such observations, Sethares (2005) argued that musical instruments co-evolved with tuning systems and scales. Musical instruments that are played in combination with one another must be tuned in a way that supports their combination, and this approach to tuning gives rise to the scales that shape musical structure. Once a tuning system is established, a musical tradition can also support new instruments that have spectral properties consistent with that tuning system. This process of co-evolution explains why gamelan scales and their instrument timbres, which are so unique, are rarely combined with the scales of Western music. In traditions that mainly employ instruments with harmonic spectra, the tuning systems that support the formation of consonant intervals are also compatible with pentatonic (six note) and heptatonic (seven note, diatonic) scales. According to some researchers and theorists, this correspondence explains why major and minor pentatonic and heptatonic scales are the most widely used scales in Western, Indian, Chinese, and Arabic music over the past several centuries (Gill & Purves, 2009; Sethares, 2005). Gill and Purves (2009) observed that the component intervals of the most widely used scales throughout history and across cultures are those with the greatest overall spectral similarity to a harmonic series. The intervals derived from possible scales were evaluated for their degree of similarity to a harmonic series. Similarity was expressed as a percentage of harmonic frequencies that the dyad holds in common with a harmonic series defined by the greatest common divisor of the harmonic frequencies in the dyad. For example, if the upper tone of an interval has partials at 300, 600, and 900 Hz, and the lower tone has partials at 200, 400, and 600 Hz (a perfect fifth), then the lowest common divisor is 100 Hz. A harmonic series with a fundamental frequency at 100 Hz and the highest partial at 900 Hz (matched to the highest partial in the dyad) has nine partials. Of those nine partials, six are found in the dyad. Therefore, the percentage similarity between the dyad and a harmonic series is 100(6 4 9) 5 67%. Only intervals that can be produced within a one-octave range were analyzed, and all intervals that can be formed within a given scale contributed equally to the similarity value for that scale. Because pitch is a continuum and there are an infinite number of possible scales, the scale notes were restricted to 60 possible pitches within a one-octave range, separated from each other by roughly 20 cents (one fifth of a semitone). Given these 60 possible pitches, all possible five-tone (pentatonic) and seven-tone (heptatonic) scales were analyzed. This constraint resulted in 455,126 possible pentatonic scales and more than 45 million heptatonic scales.

4. Intervals and Scales

131

Among this vast number of possible scales, those with the greatest overall similarity to the harmonic series were the very scales that are used most widely across cultures and throughout history. The authors proposed that there is a biologically based preference for the harmonic series, and this preference is reflected in the scales that are used in music. An explanation with fewer assumptions, however, is that the spectral properties of the instruments used in a musical tradition influence the scales that are used (Sethares, 2005). Because a high proportion of instruments produce periodic sounds, including the human voice, most scales permit intervals that have spectral properties that are similar to the harmonic series (and hence are low in dissonance). However, traditions such as Javanese gamelan music that use inharmonic instruments have very different scales. The slendro and pelog scales permit intervals that are not similar to the harmonic series but that are predictable from the spectral properties of the instruments used in that tradition.

IV.

Overview

Relative changes in pitch are salient sources of information in both music and speech. Unlike speech, music focuses on a collection of discrete pitches. Simultaneous and sequential combinations of these pitches occur extensively in music and are highly meaningful. Simultaneous intervals differ in the level of consonance and dissonance they produce. Consonant intervals such as the octave and fifth have many partials in common, and those that are unique are seldom within a critical band and do not give rise to roughness. Sensory factors constrain preferences for musical intervals, but early preferences can also be modified by learning and enculturation (see also, Guernsey, 1928; McLachlan, 2011). Sequential intervals are the basis for melody. Whereas simultaneous intervals are constrained by processes related to consonance, dissonance, and fusion, sequential intervals are subject to constraints of auditory streaming. Music generates significant interactions between these types of intervals: fusion between simultaneous intervals can be avoided by emphasizing horizontal structure, allowing listeners to perceive individual voices in polyphonic music and reducing any potential dissonance between concurrent tones. Mechanisms underlying melody processing may be engaged for domains other than music, such as speech intonation. Indeed, the capacity to extract contour may be a general property of the auditory system. Whether interval perception has a special status in the auditory system remains unclear. Our perceptions of the pitch distances in intervals are susceptible to a wide range of extraneous influences, including timbre, pitch register, direction of pitch change, tonal context, and visual signals arising from performers. Intervals also vary in performance when variablepitch instruments are used. Such changes depend on both the technical skills and the expressive intentions of performers. Expressive intonation is detectable but does not tend to alter the perceived interval category. Scales enable precise distinctions between interval sizes. Trained and untrained listeners are highly sensitive to scales and can even sing an underlying scale after

132

William Forde Thompson

hearing just a few notes of music. During music listening, however, understanding of scales may be less important than mechanisms of statistical learning. Because scale development depends on instrument timbres, there is no one ideal scale or tuning system. For music that emphasizes instruments with harmonic spectra, scales tend to permit the formation of intervals such as the octave, fifth, and thirdintervals also found in the harmonic spectra of periodic sounds. For music that emphasizes instruments with inharmonic spectra, scales permit other intervals that reflect those spectra. Nonetheless, most scales throughout history and across cultures are predictable from the harmonic series, reflecting the prevalence of harmonic spectra in musical instruments, including the human voice.

Acknowledgments
I thank Richard Parncutt, Neil McLachlan, and Catherine Greentree for helpful comments, suggestions, and editorial assistance.

References
Ammirante, P., & Thompson, W. F. (2010). Melodic accent as an emergent property of tonal motion. Empirical Musicology Review, 5, 94107. Ammirante, P., & Thompson, W. F. (2012). Continuation tapping to triggered melodies: motor resonance effects of melodic motion. Experimental Brain Research, 216(1), 5160. Ammirante, P., Thompson, W. F., & Russo, F. A. (2011). Ideomotor effects of pitch in continuation tapping. Quarterly Journal of Experimental Psychology, 64, 381393. Attneave, F., & Olson, R. K. (1971). Pitch as medium: a new approach to psychophysical scaling. American Journal of Psychology, 84, 147166. Balzano, G. J. (1977). On the bases of similarity of musical intervals [Abstract]. Journal of the Acoustical Society of America, 61, S51. Balzano, G. J. (1980). The group-theoretic description of 12-fold and microtonal pitch systems. Computer Music Journal, 4(4), 6684. Balzano, G. J. (1982). The pitch set as a level of description for studying musical pitch perception. In M. Clynes (Ed.), Music, mind and brain (pp. 321351). New York, NY: Plenum. Beck, J., & Shaw, W. A. (1961). The scaling of pitch by the method of magnitude estimation. American Journal of Psychology, 74, 242251. Bidelman, G. M., & Krishnan, A. (2009). Neural correlates of consonance, dissonance, and the hierarchy of musical pitch in the human brainstem. The Journal of Neuroscience, 29, 1316513171. Bidelman, G. M., & Krishnan, A. (2011). Brainstem correlates of behavioral and compositional preferences of musical harmony. Neuroreport, 22(5), 212216. Bidet-Caulet, A., & Bertrand, O. (2009). Neurophysiological mechanisms involved in auditory perceptual organization. Frontiers in Neuroscience, 3, 182191. Boltz, M. (1998). Tempo discrimination of musical patterns: effects due to pitch and rhythmic structure. Perception & Psychophysics, 60, 13571373.

4. Intervals and Scales

133

Boltz, M., & Jones, M. R. (1986). Does rule recursion make melodies easier to reproduce? If not, what does? Cognitive Psychology, 18, 389431. Bregman, A. S. (1990). Auditory scene analysis: The perceptual organization of sound. Cambridge, MA: The MIT Press. Burns, E. M. (1974). Octave adjustment by non-western musicians [abstract]. Journal of the Acoustical Society of America, 56, S25S26. Burns, E. M. (1999). Intervals, scales, and tuning. In D. Deutsch (Ed.), The psychology of music (2nd ed., pp. 215264). New York, NY: Academic Press. Burns, E. M., & Ward, W. D. (1978). Categorical perceptionphenomenon or epiphenomenon: evidence from experiments in the perception of melodic musical intervals. Journal of the Acoustical Society of America, 63, 456468. Butler, J. W., & Daston, P. G. (1968). Musical consonance as musical preference: a crosscultural study. Journal of General Psychology, 79, 129142. Carlyon, R. P. (2004). How the brain separates sounds. Trends in Cognitive Science, 10, 465471. Cazden, N. (1945). Musical consonance and dissonance: a cultural criterion. Journal of Aesthetics and Art Criticism, 4(1), 311. Chang, H. -W., & Trehub, S. E. (1977). Auditory processing of relational information by young infants. Journal of Experimental Child Psychology, 24(2), 324331. Chiandetti, C., & Vallortigara, G. (2011). Chicks like consonant music. Psychological Science, 22(10), 12701273. doi:10.1177/0956797611418244 Cohen, A. J. (1991). Tonality and perception: musical scales primed by excerpts from the Well Tempered Clavier of J. S. Bach. Psychological Research, 28, 255270. Cook, N. D. (2007). Harmony perception: harmoniousness is more than the sum of interval consonance. Music Perception, 27, 2541. Cronin, C. (19971998). Concepts of melodic similarity in music-copyright infringement suits. Computing in Musicology, 11, 187209. Crowder, R. G. (1984). Perception of the major/minor distinction: I. historical and theoretical foundations. Psychomusicology: Music, Mind and Brain, 4, 312. Cuddy, L. L., & Lunney, C. A. (1995). Expectancies generated by melodic intervals: perceptual judgements of continuity. Perception & Psychophysics, 57, 451462. Demany, L., Pressnitzer, D., & Semal, C. (2009). Tuning properties of the auditory frequency-shift detectors. Journal of the Acoustical Society of America, 126, 13421348. Demany, L., & Ramos, C. (2005). On the binding of successive sounds: perceiving shifts in nonperceived pitches. Journal of the Acoustical Society of America, 117, 833841. Demany, L., Semal, C., & Pressnitzer, D. (2011). Implicit versus explicit frequency comparisons: two mechanisms of auditory change detection. Journal of Experimental Psychology: Human Perception and Performance, 37, 597605. Deutsch, D. (1969). Music recognition. Psychological Review, 76, 300307. Deutsch, D. (1992). Paradoxes of musical pitch. Scientific American, 267, 8895. Deutsch, D. (1999). The processing of pitch combinations. In D. Deutsch (Ed.), The psychology of music (2nd ed., pp. 349411). New York, NY: Academic Press. Dobbins, P. A., & Cuddy, L. L. (1982). Octave discrimination: an experimental confirmation of the stretched subjective octave. Journal of the Acoustical Society of America, 72, 411415. Dowling, W. J. (1978). Scale and contour: two components of a theory of memory for melodies. Psychological Review, 85, 341354. Dowling, W. J., & Bartlett, J. C. (1981). The importance of interval information in long-term memory for melodies. Psychomusicology, 1, 3049.

134

William Forde Thompson

Dowling, W. J., & Harwood, D. L. (1986). Music cognition. New York, NY: Academic Press. Dowling, W. J., & Fujitani, D. S. (1970). Contour, interval, and pitch recognition in memory for melodies. Journal of the Acoustical Society of America, 49, 524531. Edworthy, J. (1985). Interval and contour in melody processing. Music Perception, 2, 375388. Eimas, P. D., & Corbit, J. D. (1973). Selective adaptation of linguistic feature detectors. Cognitive Psychology, 4, 99109. Eimas, P. D., Siqueland, E. R., Jusczyk, P., & Vigorito, J. (1971). Speech perception in infants. Science, 171, 303306. ` s, R. (1988). La perception de la musique (W. J. Dowling, Transl.). Hillsdale, NJ: France Erlbaum. (Original work published 1958) Frieler, K., & Riedemann, F. (2011). Is independent creation likely to happen in pop music? Musica Scientiae, 15, 1728. Fujioka, T., Trainor, L. J., Ross, B., Kakigi, R., & Pantev, C. (2005). Automatic encoding of polyphonic melodies in musicians and nonmusicians. Journal of Cognitive Neuroscience, 17, 15781592. Garner, W. R. (1974). The processing of information and structure. Potomac, MD: Erlbaum. Gill, K. Z., & Purves, D. (2009). A biological rationale for musical scales. PLoS ONE, 4(12), e8144. doi:10.1371/journal.pone.0008144 Greenwood, D. D. (1961a). Auditory masking and the critical band. Journal of the Acoustical Society of America, 33, 484501. Greenwood, D. D. (1961b). Critical bandwidth and the frequency coordinates of the basilar membrane. Journal of the Acoustical Society of America, 33, 13441356. Greenwood, D. D. (1991). Critical bandwidth and consonance in relation to cochlear frequency-position coordinates. Journal of the Acoustical Society of America, 54, 64208. Greenwood, D. D. (1997). The Mel Scales disqualifying bias and a consistency of pitchdifference equisections in 1956 with equal cochlear distances and equal frequency ratios. Hearing Research, 103, 199224. Guernsey, M. (1928). The role of consonance and dissonance in music. American Journal of Psychology, 40, 173204. Hagerman, B., & Sundberg, J. (1980). Fundamental frequency adjustments in barbershop singing. Journal of Research in Singing, 4, 117. Han, S., Sundararajan, J., Bowling, D. L., Lake, J., & Purves, D. (2011). Co-variation of tonality in the music and speech of different cultures. PLoS ONE, 6, e20160. doi:10.1371/journal.pone.0020160 Hannon, E. E., & Trainor, L. J. (2007). Music acquisition: effects of enculturation and formal training on development. Trends in Cognitive Science, 11, 466472. Hartmann, W. M. (1993). On the origin of the enlarged melodic octave. Journal of the Acoustical Society of America, 93, 34003409. Helmholtz, H. (1954). On the sensations of tones (A. J. Ellis, Trans.). New York, NY: Dover. (Original work published 1877) Houtsma, A. J. M. (1968). Discrimination of frequency ratios [Abstract]. Journal of the Acoustical Society of America, 44, 383. Houtsma, A. J. M. (1984). Pitch salience of various complex sounds. Music Perception, 1, 296307. Huron, D. (1989). Voice denumerability in polyphonic music of homogenous timbres. Music Perception, 6, 361382.

4. Intervals and Scales

135

Huron, D. (1991a). Tonal consonance versus tonal fusion in polyphonic sonorities. Music Perception, 9, 135154. Huron, D. (1991b). Review of auditory scene analysis: the perceptual organization of sound by Albert S. Bregman. Psychology of Music, 19, 7782. Huron, D. (2001). Tone and voice: a derivation of the rules of voice leading from perceptual principles. Music Perception, 19, 164. Huron, D. (2006). Sweet anticipation: Music and the psychology of expectation. Boston, MA: MIT Press. (ISBN-13:978-0-262-58278-0) Huron, D. (2008). Asynchronous preparation of tonally fused intervals in polyphonic music. Empirical Musicology Review, 3(1), 1121. Hutchinson, W., & Knopoff, L. (1978). The acoustic component of Western consonance. Interface, 7, 129. Idson, W. L., & Massaro, D. W. (1978). A bidimensional model of pitch in the recognition of melodies. Perception & Psychophysics, 14, 551565. Ilie, G., & Thompson, W. F. (2006). A comparison of acoustic cues in music and speech for three dimensions of affect. Music Perception, 23, 319329. Ilie, G., & Thompson, W. F. (2011). Experiential and cognitive changes following seven minutes exposure to music and speech. Music Perception, 28, 247264. Jones, M. R. (1987). Dynamic pattern structure in music: recent theory and research. Perception & Psychophysics, 41, 621634. Jones, M. R., Moynihan, H., MacKenzie, N., & Puente, J. (2002). Temporal aspects of stimulusdriven attending in dynamic arrays. Psychological Science, 13, 313319. Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal expression and music performance: different channels, same code? Psychological Bulletin, 129, 770814. Kallman, H. (1982). Octave equivalence as measured by similarity ratings. Perception & Psychophysics, 32, 3749. Kameoka, W., & Kuriyagawa, M. (1969a). Consonance theory part I: consonance of dyads. Journal of the Acoustical Society of America, 45, 14521459. Kameoka, W., & Kuriyagawa, M. (1969b). Consonance theory part II: Consonance of complex tones and its calculation method. Journal of the Acoustical Society of America, 45, 14601469. Krumhansl, C. L. (1979). The psychological representation of musical pitch in a tonal context. Cognitive Psychology, 11, 346374. Krumhansl, C. L. (1985). Perceiving tonal structure in music. American Scientist, 73, 371378. Krumhansl, C. L. (1990). Cognitive foundations of musical pitch. New York, NY: Oxford University Press. Krumhansl, C. L. (1995a). Effects of musical context on similarity and expectancy. Systematische Musikwissenschaft [Systematic Musicology], 3, 211250. Krumhansl, C. L. (1995b). Music psychology and music theory: problems and prospects. Music Theory Spectrum, 17, 5390. Krumhansl, C. L., & Kessler, E. J. (1982). Tracing the dynamic changes in perceived tonal organization in a spatial representation of musical keys. Psychological Review, 89, 334368. Lee, K. M., Skoe, E., Kraus, N., & Ashley, R. (2009). Selective subcortical enhancement of musical intervals in musicians. The Journal of Neuroscience, 29, 58325840. Leman, M. (2009). Embodied music cognition and mediation technology. Cambridge, MA: MIT Press.

136

William Forde Thompson

Liegeois-Chauvel, C., Peretz, I., Babei, M., Laguitton, V., & Chauvel, P. (1998). Contribution of different cortical areas in the temporal lobes to music processing. Brain, 121, 18531867. Loosen, F. (1993). Intonation of solo violin performance with reference to equally tempered, Pythagorean, and just intonations. Journal of the Acoustical Society of America, 93, 525539. Loui, P., Wessel, D. L., & Hudson Kam, C. L. (2010). Humans rapidly learn grammatical structure in a new musical scale. Music Perception, 27, 377388. Makeig, S. (1982). Affective versus analytic perception of musical intervals. In M. Clynes (Ed.), Music, mind, and brain: The neuropsychology of music (pp. 227250). New York, NY: Plenum. Mashinter, K. (2006). Calculating sensory dissonance: Some discrepancies arising from the models of Kameoka & Kuriyagawa, and Hutchinson & Knopoff. Empirical Musicology Review, 1, 6584. McDermott, J., & Hauser, M. D. (2005). The origins of music: innateness, uniqueness, and evolution. Music Perception, 23, 2959. McDermott, A. J., Keebler, M. V., Micheyl, C., & Oxenham, A. J. (2010). Musical intervals and relative pitch: frequency resolution, not interval resolution, is special. Journal of the Acoustical Society of America, 128, 19431951. McDermott, J. H., Lehr, A. J., & Oxenham, A. J. (2008). Is relative pitch specific to pitch? Psychological Science, 19, 12631271. McDermott, J. H., Lehr, A. J., & Oxenham, A. J. (2010). Individual differences reveal the basis of consonance. Current Biology, 20, 10351041. McDermott, J. H., & Oxenham, A. J. (2008). Music perception, pitch, and the auditory system. Current Opinion in Neurobiology, 18, 112. McLachlan, N. M. (2011). A neurocognitive model of recognition and pitch segregation. Journal of the Acoustical Society of America, 130, 28452854. McLachlan, N. M., & Wilson, S. W. (2010). The central role of recognition in auditory perception: a neurobiological model. Psychological Review, 117, 175196. Melara, R. D., & Marks, L. E. (1990). Interaction among auditory dimensions: timbre, pitch, and loudness. Perception & Psychophysics, 48, 169178. Meyer, L. B. (1973). Explaining music: Essays and explorations. Berkeley, CA: University of California Press. Miall, D. S., & Dissanayake, E. (2003). The poetics of babytalk. Human Nature, 14, 337364. Micheyl, C., Carlyon, R. P., Gutschalk, A., Melcher, J. R., Oxenham, A. J., & Rauschecker, J. P., et al. (2007). The role of auditory cortex in the formation of auditory streams. Hearing Research, 229, 116131. Micheyl, C., & Oxenham, A. J. (2010). Pitch, harmonicity and concurrent sound segregation: psychoacoustical and neurophysiological findings. Hearing Research, 266, 3651. Moore, B. C. J. (2004). An introduction to the psychology of hearing (5th ed.). London, England: Elsevier Academic Press. Morrongiello, B. A., Trehub, S. E., Thorpe, L. A., & Capodilupo, S. (1985). Childrens perception of melodies: the role of contour, frequency and rate of presentation. Journal of Experimental Child Psychology, 40, 279292. llensiefen, D., & Pendzich, M. (2009). Court decisions on music plagiarism and the Mu predictive value of similarity algorithms. Musicae Scientiae, Discussion Forum, 4B, 257295. Narmour, E. (1983). Beyond Schenkerism. Chicago, IL: University of Chicago Press.

4. Intervals and Scales

137

Narmour, E. (1990). The analysis and cognition of basic melodic structures. Chicago, IL: University of Chicago Press. Narmour, E. (1992). The analysis and cognition of melodic complexity. Chicago, IL: University of Chicago Press. Navia, L. E. (1990). Pythagoras: An annotated biography. New York, NY: Garland. Neuhoff, J. G., Kramer, G., & Wayand, J. (2002). Pitch and loudness interact in auditory displays: can the data get lost in the map? Journal of Experimental Psychology: Applied, 8, 1725. Ohgushi, K. (1983). The origin of tonality and a possible explanation of the octave enlargement phenomenon. Journal of the Acoustical Society of America, 73, 16941700. Ohgushi, K., & Hatoh, T. (1992). The musical pitch of high frequency tones. In Y. Cazals, L. Demany, & K. Horner (Eds.), Auditory physiology and perception. Oxford, England: Pergamon Press. Oram, N., & Cuddy, L. L. (1995). Responsiveness of Western adults to pitch distributional information in melodic sequences. Psychological Research, 57, 103118. Overy, K., & Molnar-Szakacs, I. (2009). Being together in time: music experience and the mirror neuron system. Music Perception, 26, 489504. Parncutt, R. (1989). Harmony: A psychoacoustical approach. Berlin, Germany: SpringerVerlag. (ISBN 3-540-51279-9; 0-387-51279-9) Parncutt, R. (1993). Pitch properties of chords of octave-spaced tones. Contemporary Music Review, 9, 3550. Parncutt, R. (2006). Commentary on Keith Mashinters Calculating sensory dissonance: Some discrepancies arising from the models of Kameoka & Kuriyagawa, and Hutchinson & Knopoff. Empirical Musicology Review, 1, 201203. Partch, H. (1974). Genesis of a music (2nd ed.). New York, NY: Da Capo. Patel, A. D. (2003). Language, music and the brain. Nature Neuroscience, 6, 674681. Patel, A. D. (2008). Music, language, and the brain. New York, NY: Oxford University Press. Patel, A. D., Iversen, J. R., & Rosenberg, J. C. (2006). Comparing the rhythm and melody of speech and music: the case of British English and French. Journal of the Acoustical Society of America, 119, 30343047. Pearce, M. T., & Wiggins, G. A. (2006). Expectation in melody: the influence of context and learning. Music Perception, 23, 377405. Peretz, I., & Coltheart, M. (2003). Modularity of music processing. Nature Neuroscience, 6, 688691. Pick, A. D., Palmer, C. F., Hennessy, B. L., & Unze, M. G. (1988). Childrens perception of certain musical properties: scale and contour. Journal of Experimental Child Psychology, 45(1), 28. Plack, C. J. (2010). Musical consonance: the importance of harmonicity. Current Biology, 20 (11), R476R478. doi:10.1016/j.cub.2010.03.044 Plomp, R., & Levelt, W. J. M. (1965). Tonal consonance and critical bandwidth. Journal of the Acoustical Society of America, 38, 548560. Prince, J. B, Schmuckler, M. A., & Thompson, W. F. (2009). Cross-modal melodic contour similarity. Canadian Acoustics, 37, 3549. Prinz, W. (1996). Perception and action planning. European Journal of Psychology, 9, 129154. Rakowski, A. (1976). Tuning of isolated musical intervals. Journal of the Acoustical Society of America, 59, S50.

138

William Forde Thompson

Rakowski, A. (1982). Psychoacoustic dissonance in pure-tone intervals: disparities and common findings. In C. Dahlhaus, & M. Krause (Eds.), Tiefenstruktur der Musik (pp. 5167). t Berlin. Berlin, Germany: Technische Universita Rakowski, A. (1985a). The perception of musical intervals by music students. Bulletin of the Council for Research in Music Education, 85, 175186. Rakowski, A. (1985b). Deviations from equal temperament in tuning isolated musical intervals. Archives of Acoustics, 10, 95104. Rakowski, A. (1990). Intonation variants of musical intervals in isolation and in musical contexts. Psychology of Music, 18, 6072. Rakowski, A. (1994). Musicians tendency to stretch larger-than-octave melodic intervals. Journal of the Acoustical Society of America, 96, 3301. Repp, B. (1993). Music as motion: a synopsis of Alexander Truslits Gestaltung und Bewegung in der Musik. Psychology of Music, 21, 4872. Rosner, B. S. (1999). Stretching and compression in the perception of musical intervals. Music Perception, 17, 101114. Ross, J. (1984). Measurement of melodic intervals in performed music: some results. In J. Ross (Ed.), Symposium: Computational models of hearing and vision: Summaries (pp. 5052). Tallinn, Estonia: Estonian SSR Academy of Sciences. Ross, D., Choi, J., & Purves, D. (2007). Musical intervals in speech. Proceedings of the National Academy of Sciences, 104, 98529857. Russo, F., & Thompson, W. F. (2005a). The subjective size of melodic intervals over a two-octave range. Psychonomic Bulletin and Review, 12, 10681075. Russo, F. A., & Thompson, W. F. (2005b). An interval size illusion: extra pitch influences on the perceived size of melodic intervals. Perception & Psychophysics, 67, 559568. Schellenberg, E. G. (1996). Expectancy in melody: tests of the implication-realization model. Cognition, 58, 75125. Schellenberg, E. G. (1997). Simplifying the implication-realization model of melodic expectancy. Music Perception, 14, 295318. Schellenberg, E. G., Adachi, M., Purdy, K. T., & McKinnon, M. C. (2002). Expectancy in melody: tests of children and adults. Journal of Experimental Psychology: General, 131, 511537. Schmuckler, M. A. (2004). Pitch and pitch structures. In J. Neuhoff (Ed.), Ecological psychoacoustics (pp. 271315). San Diego, CA: Elsevier Science. Schuppert, M., Munte, T. M., Wieringa, B. M., & Altenmuller, E. (2000). Receptive amusia: evidence for cross-hemispheric neural networks underlying music processing strategies. Brain, 123, 546559. Semal, C., & Demany, L. (1990). The upper limit of musical pitch. Music Perception, 8, 165175. Sethares, W. A. (2005). Tuning, timbre, spectrum, scale (2nd ed.) London, England: Springer-Verlag. (ISBN: 1-85233-797-4) Shepard, R. N. (1964). Circularity in judgments of relative pitch. Journal of the Acoustical Society of America, 36, 23452353. Shepard, R. N. (1982a). Geometric approximations to the structure of musical pitch. Psychological Review, 89, 305333. Shepard, R. N. (1982b). Structural representations of musical pitch. In D. Deutsch (Ed.), The psychology of music (1st ed., pp. 343390). New York, NY: Academic Press. Shepard, R. N. (2011). One cognitive psychologists quest for the structural grounds of music cognition. Psychomusicology: Music, Mind and Brain, 20, 130157.

4. Intervals and Scales

139

Shove, P., & Repp, B. (1995). Music motion and performance. Theoretical and empirical perspectives. In J. Rink (Ed.), The practice of performance: Studies in musical interpretation (pp. 5583). Cambridge, England: Cambridge University Press. Siegel, J. A., & Siegel, W. (1977). Categorical perception of tonal intervals: musicians cant tell sharp from flat. Perception & Psychophysics, 21, 399407. Southall, B. (2008). Pop goes to court. London, England: Omnibus Press. (ISBN: 978.1.84772.113.6) Stevens, S. S., & Volkmann, J. (1940). The relation of pitch to frequency: a revised scale. American Journal of Psychology, 53, 329353. Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). A scale for the measurement of the psychological magnitude pitch. Journal of the Acoustical Society of America, 8, 185190. Streeter, L. A. (1976). Language perception of 2-month-old infants shows effects of both innate mechanisms and experience. Nature, 259, 3941. Stumpf, K. (1890). Tonpsychologie (Vol. 2). Leipzig, Germany: Verlag S. Hirzel. ge zur Akustik Musikwissenschaft, 1, Stumpf, K. (1898). Konsonanz und dissonanz. Beitra 1108. Sundberg, J., & Lindquist, J. (1973). Musical octaves and pitch. Journal of the Acoustical Society of America, 54, 922927. Taylor, J. A. (1971). Perception of melodic intervals within melodic context (Unpublished doctoral dissertation). University of Washington, Seattle. Tenney, J. (1988). A history of consonance and dissonance. New York, NY: Excelsior. Terhardt, E. (1969). Oktavspreizung und Tonhohen der Schieflung bei Sinustonen. Acustica, 22, 348351. Terhardt, E. (1971). Pitch shifts of harmonics, an explanation of the octave enlargement phenomenon. Proceedings of the 7th International Congress on Acoustics, 3, 621624. Terhardt, E. (1974). Pitch, consonance, and harmony. Journal of the Acoustical Society of America, 55, 10611069. Terhardt, E. (1984). The concept of musical consonance: a link between music and psychoacoustics. Music Perception, 1, 276295. Terhardt, E., Stoll, G., & Seewann, M. (1982a). Pitch of complex signals according to virtual-pitch theory: tests, examples, and predictions. Journal of the Acoustical Society of America, 71(3), 671678. Terhardt, E., Stoll, G., & Seewann, M. (1982b). Algorithm for extraction of pitch and pitch salience from complex tonal signals. Journal of the Acoustical Society of America, 71(3), 679688. Thompson, W. F. (1996). Eugene Narmour: The Analysis and Cognition of Basic Melodic Structures (1990) and The Analysis and Cognition of Melodic Complexity (1992): A review and empirical assessment. Journal of the American Musicological Society, 49(1), 127145. Thompson, W. F. (2009). Music, thought, and feeling: Understanding the psychology of music. New York, NY: Oxford University Press. (ISBN 978-0-19-537707-1) Thompson, W. F., Balkwill, L. L., & Vernescu, R. (2000). Expectancies generated by recent exposure to melodic sequences. Memory & Cognition, 28, 547555. Thompson, W. F., Cuddy, L. L., & Plaus, C. (1997). Expectancies generated by melodic intervals: evaluation of principles of melodic implication in a melody-completion task. Perception & Psychophysics, 59, 10691076. Thompson, W. F., & Parncutt, R. (1997). Perceptual judgments of triads and dyads: assessment of a psychoacoustic model. Music Perception, 14(3), 263280. Thompson, W. F., Peter, V., Olsen, K. N., & Stevens, C. J. (2012). The effect of intensity on relative pitch. Quarterly Journal of Experimental Psychology. Advance online publication. doi:10.1080/17470218.2012.678369

140

William Forde Thompson

Thompson, W. F., & Quinto, L. (2011). Music and emotion: Psychological considerations. In P. Goldie, & E. Schellekens (Eds.), The aesthetic mind: Philosophy and psychology (pp. 357375). Oxford, England: Oxford University Press. Thompson, W. F., & Russo, F. A. (2007). Facing the music. Psychological Science, 18, 756757. Thompson, W. F., Russo, F. A., & Livingstone, S. L. (2010). Facial expressions of singers influence perceived pitch relations. Psychonomic Bulletin and Review, 17, 317322. Thompson, W. F., Russo, F. A., & Quinto, L. (2008). Audio-visual integration of emotional cues in song. Cognition & Emotion, 22(8), 14571470. Thompson, W. F., Schellenberg, E. G., & Husain, G. (2004). Decoding speech prosody: do music lessons help? Emotion, 4, 4664. Thompson, W. F., & Stainton, M. (1998). Expectancy in Bohemian folk song melodies: evaluation of implicative principles for implicative and closural intervals. Music Perception, 15, 231252. Trainor, L. J., Tsang, C. D., & Cheung, V. H. W. (2002). Preference for sensory consonance in 2- and 4-month old infants. Music Perception, 20, 187194. Trehub, S. E., Bull, D., & Thorpe, L. A. (1984). Infants perception of melodies: the role of melodic contour. Child Development, 55(3), 821830. Van Noorden, L. (1982). Two channel pitch perception. In M. Clynes (Ed.), Music, mind, and brain: The neuropsychology of music. New York, NY: Plenum Press. Vassilakis, P. (2005). Auditory roughness as a measure of musical expression. Selected Reports in Ethnomusicology, 12, 119144. ke sy, G. (1949). On the resonance curve and the decay period at various points on Von Be the cochlear partition. Journal of the Acoustic Society of America, 21, 245254. von Hippel, P., & Huron, D. (2000). Why do skips precede reversal? The effect of tessitura on melodic structure. Music Perception, 18(1), 5985. Vos, J. (1986). Purity ratings of tempered fifths and major thirds. Music Perception, 3, 221258. Vos, P. G., & Troost, J. M. (1989). Ascending and descending melodic intervals: statistical findings and their perceptual relevance. Music Perception, 6, 383396. Vurma, A., & Ross, J. (2006). Production and perception of musical intervals. Music Perception, 23, 331344. Ward, W. D. (1954). Subjective musical pitch. Journal of the Acoustical Society of America, 26, 369380. Wright, J. K. (1986). Auditory object perception: Counterpoint in a new context (Masters thesis). Montreal, Canada: McGill University. Wright, J. K., & Bregman, A. S. (1987). Auditory stream segregation and the control of dissonance in polyphonic music. Contemporary Music Review, 2, 6392. Young, R. W. (1952). Inharmonicity of plain wire piano strings. Journal of the Acoustical Society of America, 24, 267273. Zarate, J. M., Ritson, C. R., & Poeppel, D. (2012). Pitch-interval discrimination and musical expertise: Is the semitone a perceptual boundary? Journal of the Acoustical Society of America, 132, 984993. Zatorre, R. J. (1983). Category-boundary effects and speeded sorting with a harmonic musical-interval continuum: evidence for dual processing. Journal of Experimental Psychology: Human Perception and Performance, 9, 739752. Zatorre, R. J., Chen, J. L., & Penhune, V. B. (2007). When the brain plays music: auditorymotor interactions in music perception and production. Nature Reviews Neuroscience, 8, 547558. Zatorre, R. J., & Halpern, A. R. (1979). Identification, discrimination, and selective adaptation of simultaneous musical intervals. Perception & Psychophysics, 26, 384395.

5 Absolute Pitch
Diana Deutsch
Department of Psychology, University of California, San Diego, La Jolla, California

I.

Introduction

In the summer of 1763, the Mozart family embarked on the famous tour of Europe that established 7-year-old Wolfgangs reputation as a musical prodigy. Just before they left, an anonymous letter appeared in the Augsburgischer Intelligenz-Zettel describing the young composers remarkable abilities. The letter included the following passage:
Furthermore, I saw and heard how, when he was made to listen in another room, they would give him notes, now high, now low, not only on the pianoforte but on every other imaginable instrument as well, and he came out with the letter of the name of the note in an instant. Indeed, on hearing a bell toll, or a clock or even a pocket watch strike, he was able at the same moment to name the note of the bell or timepiece.

This passage furnishes a good characterization of absolute pitch (AP)otherwise known as perfect pitchthe ability to name or produce a note of a given pitch in the absence of a reference note. AP possessors name musical notes as effortlessly and rapidly as most people name colors, and they generally do so without specific training. The ability is very rare in North America and Europe, with its prevalence in the general population estimated as less than one in 10,000 (Bachem, 1955; Profita & Bidder, 1988; Takeuchi & Hulse, 1993). Because of its rarity, and because a substantial number of world-class composers and performers are known to possess it, AP is often regarded as a perplexing ability that occurs only in exceptionally gifted individuals. However, its genesis and characteristics are unclear, and these have recently become the subject of considerable research. In contrast to the rarity of AP, the ability to name relationships between notes is very common among musicians. Most trained musicians have no difficulty in naming the ascending pattern D-Fx as a major third, E-B as a perfect fifth, and so on. Further, when given the name of one of these notes, they generally have no difficulty in producing the name of the other note, using relative pitch as the cue. Yet most musicians, at least in Western cultures, are unable to name a note when it is presented in isolation.
The Psychology of Music. DOI: http://dx.doi.org/10.1016/B978-0-12-381460-9.00005-5 2013 Elsevier Inc. All rights reserved.

142

Diana Deutsch

The rarity of AP presents us with an enigma. We can take color naming as an analogy: When we label a color as red, we do not do so by comparing it with another color (such as blue) and determining the relationship between the two colors; the labeling process is instead direct and immediate. Consider, also, that note naming involves choosing between only 12 possibilitiesthe 12 notes within the octave (Figure 1). Such a task should be trivial for musicians, who typically spend thousands of hours reading musical scores, playing the notes they read, and hearing the notes they play. In addition, most people have no difficulty naming well-known melodies, yet this task is considerably more complex than is naming a single note. It appears, therefore, that the lack of AP is analogous to color anomia (Geschwind & Fusillo, 1966), in which patients can recognize and discriminate colors, yet cannot associate them with verbal labels (Deutsch, 1987, 1992; Deutsch, Kuyper, & Fisher, 1987).

II.

Implicit AP

Reasoning along these lines, it is not surprising that most people possess an implicit form of AP, even though they are unable to name the notes they are judging. This has been demonstrated in a number of ways. One concerns the tritone paradoxa musical illusion in which people judge the relative heights of tones based on their positions along the pitch class circle, even though they are unaware of doing so. In addition, AP nonpossessors can often judge whether a familiar piece of music is being played in the correct key, and their reproductions of familiar melodies can also reflect implicit AP.

A. The Tritone Paradox


The tritone paradox was first reported by Deutsch (1986). The basic pattern that produces this illusion consists of two sequentially presented tones that are related by a half-octave (or tritone). Shepard tones are employed, so that their note names (pitch classes) are clearly defined, but they are ambiguous in terms of which octave they are in. For example, one tone might clearly be an A, but could in principle be
C B A A G G F F E C D D

Figure 1 The pitch class circle.

5. Absolute Pitch

143

Concert A, or the A an octave above, or the A an octave below. When one such tone pair is played (say C followed by Fx), some listeners hear an ascending pattern, whereas others hear a descending one. Yet when a different tone pair is played (say, G followed by Cx), the first group of listeners may well hear a descending pattern and the second group an ascending one. Importantly, for any given listener, the pitch classes generally arrange themselves with respect to height in a systematic way: Tones in one region of the pitch class circle are heard as higher, and tones in the opposite region are heard as lower (Figure 2). This occurs even when the spectral envelopes of the tones are averaged over different positions along the frequency continuum, so controlling for spectral effects (Deutsch, 1987, 1992, 1994; Deutsch et al., 1987; Deutsch, Henthorn, & Dolson, 2004b; Giangrande, 1998; Repp & Thompson, 2010). In experiencing the tritone paradox, then, listeners must be referring to the pitch classes of tones in judging their relative heights, so invoking an implicit form of AP. The same conclusion stems from listeners percepts of related illusions involving two-part patterns; for example, the melodic paradox (Deutsch, Moore, & Dolson, 1986) and the semitone paradox (Deutsch, 1988). These paradoxes of pitch perception are described in Chapters 6 and 7.

B. Pitch Identification and Production


As a further reflection of implicit AP, musicians who are not AP possessors sometimes remark that they can identify the key in which a piece is played (Sergeant, 1969; Spender, 1980). To explore this claim, Terhardt and Ward (1982) and Terhardt and Seewann (1983) recruited musically literate subjects, most of whom were AP nonpossessors, and presented them with excerpts from Bach preludes that
Pattern heard descending (%) 100 80 60 40 20 0 C C D D E F F GG A A B 100 80 60 40 20 0 C C D D E F F GG A A B

Pitch class of first tone

Figure 2 The tritone paradox as perceived by two subjects. The graphs show the percentages of judgments that a tone pair formed a descending pattern, as a function of the pitch class of the first tone of the pair. The judgments of both subjects displayed orderly relationships to the positions of the tones along the pitch class circle, showing that they were employing implicit absolute pitch in making these judgments.

144

Diana Deutsch

were either in the original key or transposed by various amounts. The subjects were able to judge to a significant extent whether or not the excerpts were in the original key. Specifically, Terhardt and Seewann (1983) found that the large majority of subjects achieved significant identification performance overall, with almost half of them being able to distinguish the nominal key from transpositions of one semitone. In a further study, Vitouch and Gaugusch (2000) presented AP nonpossessors with Bachs first prelude in C major on several subsequent days. On any one occasion, the piece was presented either in the correct key or as transposed by a semitone, and the subjects were able to determine beyond chance whether they were hearing the original version or the transposed one (see also Gussmack, Vitouch, & Gula, 2006). An even more general effect was found by Schellenberg and Trehub (2003), who presented unselected college students with familiar theme songs from television shows, and found that the students could discriminate above chance whether or not a song had been transposed by one or two semitones (see also Trehub, Schellenberg, & Nakata, 2008). A further experiment was carried out by Smith and Schmuckler (2008) to evaluate the prevalence of implicit AP in the general population. The telephone dial tone in North America consists of two tones at 350 and 440 Hz; this has been ubiquitous for decades, so most people in North America have been exposed to the sound on thousands of occasions. AP nonpossessors listened to the dial tone and various pitch-shifted versions, and classified each example as normal, higher than normal, or lower than normal. Although the subjects judgments reflected a more broadly tuned sensitivity than exists among AP possessors, they could nevertheless judge as higher than normal a tone that had been transposed by three semitones. Implicit AP even occurs very early in life, before speech is acquired. This was shown by Saffran and Griepentrog (2001) who found that 8- to 9-month-old infants were more likely to track patterns of absolute than relative pitches in performing a statistical learning task. Production tasks have confirmed the presence of implicit AP in the general population. Halpern (1989) asked subjects who were unselected for musical training to hum or sing the first notes of well-known tunes on two separate days, and found that the within-subject variability of the pitch ranges of their renditions was very low. In a further study, Levitin (1994) had subjects choose a CD that contained a popular song with which they were familiar, and then reproduce the song by humming, whistling, or singing. The songs had been performed by only one musical band, so presumably had been heard in only one key. On comparing the pitches of the first notes produced by the subjects with the equivalent ones on the CD, Levitin found that when tested with two different songs, 44% of the subjects came within two semitones of the correct pitch for both songs. In a further study, Bergeson and Trehub (2002) had mothers sing the same song to their infants in two sessions that were separated by at least a week, and based on judges estimates, their pitch ranges in the different sessions deviated on average by less than a semitone.

5. Absolute Pitch

145

III.

Genesis of AP

Given that AP is rare in the Western world, there have been many speculations concerning its genesis. These fall into three general categories: first, that the ability can be acquired at any time through intensive practice; second, that it is an inherited trait that becomes manifest as soon as the opportunity arises; and third, that most people have the potential to acquire AP, but in order for this potential to be realized, they need to be exposed to pitches in association with their note names during a critical period early in life. All three views have been espoused vigorously by a number of researchers.

A. The Practice Hypothesis


Various attempts have been made to acquire AP in adulthood through extensive practice, and in general, these have produced negative or unconvincing results (Cuddy, 1968; Gough, 1922; Heller & Auerbach, 1972; Meyer, 1899; Mull, 1925; Takeuchi & Hulse, 1993; Ward, 1999; Wedell, 1934). An unusually positive finding was described by Brady (1970)a musician who had begun piano training at age 7, and who tested himself in a single-case study. He practiced with training tapes for roughly 60 hours, and achieved a success rate of 65% correct (97% correct allowing for semitone errors). While impressive, Bradys unique finding underscores the extreme difficulty of acquiring AP in adulthood, in contrast with its effortless, and often unconscious, acquisition in early childhood.

B. The Genetic Hypothesis


The view that AP is an inherited trait has had spirited advocates for many decades (Athos et al., 2007; Bachem, 1940, 1955; Baharloo, Johnston, Service, Gitschier, & Freimer, 1998; Baharloo, Service, Risch, Gitschier, & Freimer, 2000; Gregersen, Kowalsky, Kohn, & Marvin, 1999, 2001; Profita & Bidder, 1988; Revesz, 1953; Theusch, Basu, & Gitschier, 2009). One argument for this view is that the ability often appears at a very young age, even when the child has had little or no formal musical training. AP possessors frequently remark that they have possessed the ability for as long as they can remember (Carpenter, 1951; Corliss, 1973; Takeuchi, 1989). On a personal note, I can still recall my astonishment on discovering, at age 4, that other people (even grownups) were unable to name notes that were being played on the piano without looking to see what key was being struck. Presumably I had received some musical training at that point, but this would have been minimal. Another argument for the genetic view is that AP tends to run in families (Bachem, 1940, 1955; Baharloo et al., 1998, 2000; Gregersen et al., 1999, 2001; Profita & Bidder, 1988; Theusch et al., 2009). For example, in a survey of 600 musicians, Baharloo et al. (1998) found that self-reported AP possessors were four times more likely than nonpossessors to report that a family member possessed AP.

146

Diana Deutsch

The argument from familial aggregation is not strong, however. The probability of acquiring AP is closely dependent on early age of musical training (Section III,C), and parents who provide one child with early music lessons are likely to provide their other children with early lessons also. Indeed, Baharloo et al. (2000) has shown that early musical training itself is familial. Furthermore, it is expected that babies who are born into families that include AP possessors would frequently hear musical notes together with their names early in life, and so would have the opportunity to acquire such associations at a very young age, during the period in which they learn to name the values of other attributes, such as color. A further argument in favor of a genetic (or at least innate) contribution to AP concerns its neurological underpinnings. As described in Section VI, there is good evidence that AP possessors have a uniquely structured brain circuitry (Bermudez & Zatorre, 2009b; Keenan, Thangaraj, Halpern, & Schlaug, 2001; ncke, 2010; Ohnishi Loui, Li, Hohmann, & Schlaug, 2011; Oechslin, Meyer, & Ja ncke, Huang, & Steinmetz, 1995; Schulze, Gaab, & et al., 2001; Schlaug, Ja Schlaug, 2009; Wilson, Lusher, Wan, Dudgeon, & Reutens, 2009; Zatorre, Perry, Beckett, Westbury, & Evans, 1998), though the role of neuroplasticity in the development of this circuitry remains to be resolved. Other arguments in favor of a genetic contribution to AP have centered on its prevalence in various ethnic groups. Gregersen et al. (1999, 2001), in a survey of students in music programs of higher education in the United States, found that a high percentage of East Asian students reported possessing AP. However, Henthorn and Deutsch (2007) in a reanalysis of the Gregersen et al. (2001) data found that, considering only those respondents with early childhood in North America, the prevalence of AP did not differ between the East Asian and Caucasian respondents. Yet this prevalence was significantly higher among respondents who had spent their early childhood in East Asia rather than North America. An environmental factor or factors must therefore have been a strong determinant of the findings by Gregersen et al. As is argued later (Section IV,D), there is strong evidence that the type of language spoken by the listener strongly influences the predisposition to acquire AP. Further evidence with respect to the genetic hypothesis concerns the distributions of AP scores that have been found in various studies. Athos et al. (2007) administered a Web-based test for AP, and obtained responses from more than 2000 self-selected participants. The scores were not continuously distributed and appeared to be bimodal, so the authors concluded that AP possessors constitute a genetically distinct population. However, 44% of the participants in this study qualified as AP possessorsa percentage far exceeding that in the general populationso that self-selection and other problems involved in unconstrained Web-based data collection render these findings problematic to interpret. Avoiding the problem of Web-based testing, Bermudez and Zatorre (2009a) advertised for musically trained subjects both with and without AP and tested them in the laboratory. When formally tested for AP, some subjects performed at a very high level of accuracy, while others performed at chance. However the performance of a significant number of subjects fell between these two extremes, again providing evidence that AP is not an all-or-none trait. Yet because the subjects

5. Absolute Pitch

147

were self-selected, the distribution of scores found in this study is also equivocal in its interpretation. To avoid the problem of self-selection, Deutsch, Dooley, Henthorn, and Head (2009) carried out a direct-test study to evaluate the prevalence of AP among firstand second-year students at the University of Southern California Thornton School of Music. The students were tested in class and were not self-selected. Figure 3 shows the distribution of the scores among the 176 subjects who were Caucasian nontone language speakers, together with the hypothetical distribution of scores based on chance performance. As can be seen, the scores of most subjects were consistent with chance, with the distribution being slightly elevated at the high end; however the scores of a significant proportion of subjects were above chance yet below the generally accepted criteria for AP. Other studies have confirmed that a significant proportion of the population are borderline AP possessors (Athos et al., 2007; Baharloo et al., 1998; Deutsch, Le, Shen, & Li, 2011; Dooley & Deutsch, 2010; Itoh, Suwazono, Arao, Miyazaki, & Nakada, 2005; Loui et al., 2011; Miyazaki, 1990; Oechslin et al., 2010; Rakowski & Morawska-Bungeler, 1987; Wilson et al., 2009). Returning to the genetic issue, since most complex human traits exhibit a bellshaped, continuous distribution, with exceptional individuals occupying the tail end of the curve (Drayna, 2007), the distributions of scores found on AP tests are indeed unusual, even though not strictly bimodal. This could reflect a genetic contribution to the predisposition to acquire AP. However other factors, to be described later, would also be expected to skew such distributions. Ultimately, the demonstration of a genetic contribution to AP awaits the discovery of a gene or
100 Nontone language AP scores Chance

Percentage of Subjects

80

60

40

20

10

20

30

40

50

60

70

80

90

100

Percentage Correct

Figure 3 Distribution of absolute pitch in a population of nontone language speakers. The solid line shows the distribution of scores on a test of absolute pitch among nontone language speaking students in a large-scale study at an American music conservatory. The dashed line shows the hypothetical distribution of scores expected from chance performance. Adapted from Deutsch, Dooley, et al. (2009).

148

Diana Deutsch

genes that contribute to this trait. As a step in this direction, Theusch et al. (2009) have provided preliminary evidence for a genome-wide linkage on chromosome 8 in families with European ancestry that include AP possessors.

C. The Critical Period Hypothesis


A large number of studies have pointed to an association between AP possession and early age of onset of musical training (Bachem, 1940; Baharloo et al., 1998, 2000; Deutsch, Henthorn, Marvin, & Xu, 2006; Deutsch, Dooley, et al., 2009; Deutsch et al., 2011; Dooley & Deutsch, 2010, 2011; Gregersen et al., 1999; Lee & Lee, 2010; Levitin & Rogers, 2005; Miyazaki, 1988; Miyazaki & Ogawa, 2006; Profita & Bidder, 1988; Sergeant, 1969; Takeuchi, 1989; Takeuchi & Hulse, 1993; van Krevelen, 1951; Vitouch, 2003; Ward, 1999). Although many of these studies have involved small numbers of subjects, large-scale studies on this issue have also been carried out. Some of these have been surveys, in which respondents stated by selfreport whether or not they possessed AP. For example, Baharloo et al. (1998) in a survey of 600 musicians, found that 40% of those who had begun musical training by age 4 self-reported having AP; this contrasted with 27% of those who had begun training at ages 46, 8% of those who had begun training at ages 69, and 4% of those who had begun training at ages 912. (As a caution, we should note that while the correlation with age of onset of musical training found here is impressive, absolute percentages of AP possession derived from self-report of self-selected respondents are likely to be exaggerated.) In addition, Gregersen et al. (1999) in a survey of more than 2000 music students observed that self-reported AP possessors had begun musical training at an average age of 5.4 years. The dependence on age of onset of musical training indicated in these surveys has been confirmed in large-scale direct-test studies. Deutsch et al. (2006) administered a test of AP to 88 students at the Central Conservatory of Music in Beijing, and to 115 students at Eastman School of Music, using a score of at least 85% correct as the criterion for AP possession. The students were tested in class, with no self-selection from within the target population. As discussed later, there was a large effect of language, with the Beijing group being speakers of Mandarin and the Eastman group being speakers of nontone languages such as English. However there was, in addition, a systematic effect of age of onset of musical training. For the nontone language speakers, among those who had begun training at ages 45, 14% met the criterion, whereas 6% of those who had begun training at ages 67 did so, and none of those who had begun training at age 8 or later did so. For the tone language speakers, among those who had begun musical training at ages 45, 60% met the criterion; compared with 55% of those who had begun training at ages 67 and 42% of those who had begun training at ages 89. Further large-scale direct-test studies have confirmed the correlation between age of onset of training and the possession of AP (Deutsch, Dooley, et al., 2009; Deutsch et al., 2011; Lee & Lee, 2010), and these are discussed in Section IV,D. Other studies pointing to the importance of early exposure to musical notes and their names have involved testing children. Russo, Windell, and Cuddy (2003)

5. Absolute Pitch

149

trained children and adults to identify a single note from among a set of seven possible notes, and found that by the third week of training, the identification accuracy of children aged 56 surpassed the accuracy of a group of adults. In another study, Miyazaki and Ogawa (2006) tested children at a Yamaha School of Music in Japan, and found that their pitch-naming scores increased markedly from ages 4 to 7.

D. Influence of Type of Musical Training


It is often surmised that fixed-do methods of musical training are more conducive to the development of AP than are moveable-do methods. In fixed-do systems, ` ge symbols (do, re, mi, etc.) define actual pitches, being equivalent to C, Cx, solfe ` ge symbols are instead D, etc. In moveable-do systems, on the other hand, solfe used to define the roles of pitches relative to a tonic, while letter names (C, Cx, D, etc.) are used to define the actual pitches. One argument that has been advanced in favor of fixed-do methods is that AP is more prevalent in certain countries where fixed-do training is quite common, such as Japan, whereas AP is rare in certain other countries, such as England, where moveable-do training is more common instead. However, in yet other countries where fixed-do training is also common, such as France, the prevalence of AP is again rare, so the argument in favor of fixed-do training based on prevalence of AP in a few selected countries is a problematic one. Gregersen et al. (2001) noted that a high proportion of East Asians self-reported having AP, but acknowledged that fixed-do training alone could not account for their results. They observed, however, that AP possessors were more likely to have had fixed-do rather than moveable-do training. Yet unfortunately the authors did not take age of onset of musical training into account in their analysis, so their findings could instead have reflected an earlier age of onset of music lessons among those with fixed-do training. Peng, Deutsch, Henthorn, Su, and Wang (in press) conducted a large-scale directtest study on 283 first- and second-year students in music departments at three universities in South China: South China Normal University, Guangdong University of Foreign Studies, and South China University of Technology. Again, the students were tested in class, and the subjects were not self-selected. They were administered the same AP test as in Deutsch et al. (2006), and were asked to write down the name of each note when they heard it. Depending on their preference, they could respond either by letter name (C, Cx, D, and so on) indicating moveable-do training, or by ` ge name (do, do-sharp, re, and so on) indicating fixed-do training. The expected solfe effect of age of onset was obtained, and interestingly a large effect in favor of moveable-do training was also obtained. For those subjects with an age-of-onset of 9 years or less, the percentage correct on the AP test among the moveable-do subjects was almost double that among the fixed-do subjects. As a further interesting point, a far larger number of subjects responded using letter names than fixed-do ` ge names, indicating that moveable-do training methods are highly prevalent in solfe China, where the prevalence of AP is also high.

150

Diana Deutsch

A more convincing point with respect to type of musical training is that children who are first taught to play on transposing instruments are at a clear disadvantage for the acquisition of AP. For example, a notated C on a Bw clarinet is played as the note Bw rather than C, and a notated C on an F horn is played as the note F. Such discrepancies between the viewed and sounded notes would be expected to discourage the acquisition of AP. In addition, in the study by Peng et al. (in press) just described, those subjects who had been trained on Western-style musical instruments substantially outperformed those who had been trained with folk or vocal music.

IV.

AP and Speech Processing

A linkage between AP and speech processing is indicated from various lines of evidence. First, in experiencing the tritone paradox, percepts vary depending on the language or dialect to which the listener has been exposed, particularly in childhood. Second, the critical periods for acquisition of AP and speech have remarkably similar timetables. Third, the neuroanatomical evidence points to a commonality of brain structures that underlie AP and speech processing. Fourth, the prevalence of AP is very high among speakers of tone languages, in which pitch is critically involved in determining lexical meaning.

A. Evidence from the Tritone Paradox


One body of evidence pointing to a linkage between AP and speech concerns the tritone paradox (Deutsch, 1986, 1991, 1992; Deutsch, Henthorn, & Dolson, 2004b; Deutsch et al., 1987; Deutsch, North, & Ray, 1990). As described earlier, judgments of this pattern show systematic relationships to the positions of the tones along the pitch class circle, even though the listeners are unable to name the tones they are judging. Further research has shown that the form of this relationship varies with the language or dialect to which the listener has been exposed (Chalikia & Leinfelt, 2000; Chalikia, Norberg, & Paterakis, 2000; Chalikia & Vaid, 1999; Dawe, Platt, & Welsh, 1998; Deutsch, 1991, 1994; Deutsch et al., 2004b; Giangrande, 1998; Ragozzine & Deutsch, 1994), and also correlates with the pitch range of the listeners speaking voice (Deutsch et al., 1990, 2004b), which in turn varies depending on the speakers language or dialect (Dolson, 1994; Deutsch et al., 2004b; Deutsch, Le, Shen, & Henthorn, 2009). The tritone paradox, then, provides an example of implicit AP that is closely related to phonological processing of speech.

B. Critical Periods for AP and Speech


The verbal labeling of pitches necessarily involves language, and this leads to the conjecture that the critical period for acquiring AP might be linked to that for acquiring speech. Lennenberg (1967) pointed out that adults and young children

5. Absolute Pitch

151

acquire a second language in qualitatively different ways. Following puberty, such acquisition is self-conscious and labored, and a second language that is acquired in adulthood is generally spoken with a foreign accent (see also Scovel, 1969; Patkowski, 1990). Of particular interest, the aspect of second language that is most difficult to acquire is phonological. Joseph Conrad provides a famous example here. He learned English at age 18, and after a few years of practice produced some of the best works of English literature; nevertheless, his foreign accent was strong enough to prevent him from lecturing publically in English. Since Lennenbergs book was published, there have been numerous studies of the critical period for speech acquisition (Doupe & Kuhl, 1999; Johnson & Newport, 1989; Newport, 1990; Newport, Bavelier, & Neville, 2001; Sakai, 2005). A few children who had been socially isolated early in life and later placed in a normal environment have been found not to acquire normal speech (Curtiss, 1977; Lane, 1976). Studies of recovery of speech following brain injury provide even more convincing evidence: The prognosis for recovery has been found to be most positive if the injury occurred before age 6, less positive between ages 6 and 8, and extremely poor following puberty (Bates, 1992; Dennis & Whitaker, 1976; Duchowny et al., 1996; Varyha-Khadem et al., 1997; Woods, 1983). The timetable for acquiring AP is remarkably similar to that for acquiring speech. As noted earlier, AP is extremely difficult to develop in adulthood; yet when young children acquire this ability they do so effortlessly, and often without specific training. This correspondence between timetables suggests that the two capacities may be subserved by a common brain mechanism. Notably, although there are critical periods for other aspects of development, such as for ocular dominance columns in the visual cortex of cats (Hubel & Wiesel, 1970), imprinting in ducks (Hess, 1973), and auditory localization in barn owls (Knudsen, 1988), no other critical periods have been shown to have a similar correspondence with speech and language (see also Trout, 2003). We can note that while speech is normally acquired in the first 2 years of life, formal music lessons can be initiated only when the child is more mature. Extrapolating back, then, from the age at which formal musical training can reasonably be initiated, we can conjecture that if infants are given the opportunity to associate pitches with meaningful words during the critical period for speech acquisition, they might readily develop the neural circuitry underlying AP at that time (Deutsch, 2002).

C. Neuroanatomical Evidence
Another argument for an association between AP and language concerns their neuroanatomical correlates. One region of particular importance here is the left planum temporale (PT)an area in the temporal lobe that corresponds to the core of Wernickes area, and that is critically involved in speech processing. The PT has been found to be leftward asymmetric in most human brains (Geschwind & Levitsky, 1968). Schlaug et al. (1995) first reported that this asymmetry is greater among AP possessors than among nonpossessors, and this finding has been followed up in several studies. In an experiment that specifically supports an association

152

Diana Deutsch

between AP, the left PT, and speech, Oechslin et al. (2010) found that AP possessors showed significantly greater activation in the left PT and surrounding areas when they were engaged in segmental speech processing. Furthermore, Loui et al. (2011) observed that AP possession was associated with heightened connectivity of white matter between regions subserving auditory perception and categorization in the left superior temporal loberegions that are considered to be responsible for the categorization of speech sounds (Hickok & Poeppel, 2007). The neuroanatomical substrates of AP are explored in further detail in Section VI.

D. AP and Tone Language


The argument for a linkage between AP and language is strengthened by consideration of tone languages, such as Mandarin, Cantonese, Vietnamese, and Thai. In tone languages, words assume arbitrarily different meanings depending on the tones in which they are enunciated. Lexical tone is defined both by pitch height (register) and by contour. For example, the word ma in Mandarin means mother when it is spoken in the first tone, hemp in the second tone, horse in the third tone, and a reproach in the fourth tone. Therefore when a speaker of Mandarin hears the word ma spoken in the first tone, and attributes the meaning mother, he or she is associating a particular pitchor a combination of pitches with a verbal label. Analogously, when an AP possessor hears the note Fx and attributes the label Fx, he or she is also associating a particular pitch with a verbal label. The brain substrates underlying the processing of lexical tone appear to overlap with those for processing phonemes in speech. Although the communication of prosody and emotion preferentially engages the right hemisphere in both tone and nontone language speakers (Edmondson, Chan, Siebert, & Ross, 1987; Gorelick & Ross, 1987; Hughes, Chan, & Su, 1983; Ross, 1981; Tucker, Watson, & Heilman, 1977), the processing of lexical tone is primarily a left hemisphere function. For example, impairments in lexical tone identification have been observed in aphasic patients with left-sided brain damage (Gandour & Dardarananda, 1983; Gandour et al., 1992; Moen & Sundet, 1996; Naeser & Chan, 1980; Packard, 1986). Further, normal tone language speakers exhibit a right ear advantage in dichotic listening to lexical tones (Van Lancker & Fromkin, 1973) and show left hemisphere activation in response to such tones (Gandour, Wong, & Hutchins, 1998). These lines of evidence imply that when tone language speakers perceive and produce pitches and pitch contours that signify meaningful words, circuitry in the left hemisphere is involved. From the evidence on critical periods for speech acquisition, we can assume that such circuitry develops very early in life, during the period in which infants acquire other features of speech (Doupe & Kuhl, 1999; Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992; Werker & Lalonde, 1988). So we can conjecture that if pitches are associated with meaningful words in infancy, the left hemisphere supports the association between pitches and verbal labels that subserves AP. We can further conjecture that if individuals are not provided with the opportunity to form such associations in infancy or early childhood, they should find AP very difficult to

5. Absolute Pitch

153

acquire later in life. This line of reasoning could account for the presence of implicit AP combined with the absence of explicit AP in speakers of nontone languages (see also Rakowski & Miyazaki, 2007). Given this line of reasoning, it was further surmised that tone language speakers employ precise and stable AP templates in enunciating words. As a test of this conjecture, Deutsch, Henthorn, and Dolson (1999, 2004a) gave native speakers of Vietnamese a list of words to read out on two separate days, with the words chosen so that they spanned the range of tones in Vietnamese speech. Then for each spoken word, we took pitch estimates at 5-ms intervals, and from these estimates, we derived an average pitch for each word. Then, for each subject, we calculated the difference between the average pitch for each word as it was read out on the two separate days, and we averaged these differences across words in the list. On comparing these averages across days, we found that the majority of subjects displayed averaged pitch differences of less than 0.5 semitone. In a further experiment, we presented Mandarin speakers with a list of words containing all four Mandarin tones to read out on two separate days. We found that onethird of the subjects showed averaged pitch differences across days of less than 0.25 semitone and that the Mandarin speakers were as consistent across days as on immediate repetition. However, a control group of English speakers were significantly less consistent in enunciating a list of English words across two separate days. From this, we concluded that the tone and nontone language speakers were processing the absolute pitch levels of speech in qualitatively different ways, and specifically that AP is involved in processing lexical tone. Burnham and Brooker (2002) came to a related conclusion from a study in which nontone language speakers discriminated pairs of Thai tones that were presented as speech, filtered speech, and violin sounds. In all conditions, AP possessors outperformed nonpossessors in lexical tone discrimination. The authors concluded that absolute pitch level was an important cue to the identification of Thai tones, and they surmised that the superior performance of the AP possessors was due to their having acquired AP during the speech-related critical period. Continuing along these lines, we can conjecture that speakers of tone language acquire AP for musical tones as though these were the tones of a second tone language. Based on studies of acquisition of a second language (Johnson & Newport, 1989; Newport, 1990; Newport et al., 2001; Patkowski, 1990; Scovel, 1969), we would expect that tone language speakers should acquire AP for music most proficiently in early childhood, and that such proficiency should decline as age of onset of musical training increases, leveling off at around puberty. However, we would also expect the overall prevalence of AP to be higher among tone language speakers. In relation to this, we note that tone language speakers acquire the tones of a new tone language more easily than do speakers of nontone languagesee Wayland and Guion (2004). To examine the hypothesis that AP is more prevalent among speakers of tone language, Deutsch et al. (2006) undertook a large-scale direct-test study of two groups of music conservatory students. The first group consisted of 115 first-year students taking a required course at Eastman School of Music; these were all

154

Diana Deutsch

nontone language speakers. The second group consisted of 88 first-year students taking a required course at the Central Conservatory of Music in Beijing, China; these were all speakers of Mandarin. The students were tested in class, and there was no self-selection from among the subject population. Both the tone and nontone language speakers showed orderly effects of age of onset of training; however, the tone language speakers produced substantially higher scores than did the nontone language speakers, for all levels of age of onset of training. In a further large-scale direct test study involving no self-selection of subjects, Deutsch et al. (2011) administered the same test of AP to 160 first- and secondyear students at the Shanghai Conservatory of Music. Figure 4 plots the average percentage correct for each age-of-onset subgroup, and it can be seen that the level of performance here was very high. Those who had begun musical training at or before age 5 showed an average of 83% correct not allowing for semitone errors, and 90% correct allowing for semitone errors. Those who had begun training at ages 69 showed an average of 67% correct not allowing for semitone errors, and 77% correct allowing for semitone errors. Those who had begun training at age 10 or over showed an average of 23% correct not allowing for semitone errors, and 34% correct allowing for semitone errors. Lee and Lee (2010) confirmed the high prevalence of AP among speakers of Mandarin in a direct test of 72 music students at National Taiwan Normal University, using a test similar in construction to that used by Deutsch et al. (2006), but employing three different timbres: piano, viola, and pure tone. Although they found the expected effect of age of onset of musical training, 72% of the subjects achieved overall an accuracy of 85% correct on the piano tones.

100

80 Percentage Correct

60

40

20

Figure 4 Average percentage correct on a test of absolute pitch among students in a largescale study at the Shanghai Conservatory of Music, as a function of age of onset of musical training. All subjects spoke the tone language Mandarin. Solid line shows performance not allowing for semitone errors, and dotted line shows performance allowing for semitone errors. Data from Deutsch, Le, et al. (2011).

6~9 10 Age of Onset of Musical Training

5. Absolute Pitch

155

The findings of Deutsch et al. (2006, 2011) and of Lee and Lee (2010) are in accordance with the conjecture that the acquisition of AP is subject to a speechrelated critical period, and that for tone language speakers, this process involves the same neural circuitry as is involved in acquiring the tones of a second tone language. However, the alternative hypothesis may also be considered that the prevalence differences between these groups were genetic in origin. To decide between these two explanations, Deutsch, Dooley, et al. (2009) carried out a directtest study on 203 first- and second-year students at the University of Southern California Thornton School of Music, using the same AP test as had been used earlier, and again with no self-selection from among the target population. The subjects were divided into four groups: Those in group nontone were Caucasian and spoke only nontone language. The remaining subjects were all of East Asian ethnic heritage, with both parents speaking an East Asian tone language. Those in the tone very fluent group reported that they spoke a tone language very fluently. Those in the tone fairly fluent group reported that they spoke a tone language fairly fluently. Those in the tone nonfluent group reported I can understand the language, but dont speak it fluently. Figure 5 shows the average percentage correct responses on the test of AP for each linguistic group. As before, there was a clear effect of age of onset of musical training. However, there was also an overwhelmingly strong effect of tone language fluency, holding ethnicity constant: Those subjects who spoke a tone language very fluently showed remarkably high performancefar higher than that of the Caucasian nontone language speakers, and also far higher than that of the East Asian subjects who did not speak a tone language fluently. The effect of language was even manifest in a fine-grained fashion: The performance of the tone very fluent group was significantly higher than that of each of the other groups taken separately; the performance of the tone fairly fluent group was significantly higher than that of the nontone group, and also higher than that of the tone nonfluent group. Further, the performance of the (genetically East Asian) tone nonfluent group did not differ significantly from that of the (genetically Caucasian) nontone group. In a regression analysis taking only subjects of East Asian ethnic heritage, fluency in speaking a tone language was found to be a highly significant predictor of performance. The enhanced performance levels of the tone language speakers found in the studies of Deutsch et al. (2006, 2011), Deutsch, Dooley, et al. (2009), and Lee and Lee (2010) are consistent with the survey findings of Gregersen et al. (1999, 2001) from students in music programs of higher education in the United States referred to earlier. Gregersen et al. (2001) also found that the prevalence of AP among students who were Japanese or Korean was higher than among the Caucasian students, although not as high as among the Chinese students. As described in Section III,B, the high prevalence of AP among East Asian respondents to their survey was interpreted by Gregersen et al. to indicate a genetic origin for AP. However, in a reanalysis of their data, Henthorn and Deutsch (2007) showed that the prevalence of AP among students of East Asian descent with early childhood in North America did not differ from that of Caucasians, so that their findings cannot be attributed to ethnic differences.

156

Diana Deutsch

100

80 Percentage Correct

tone very fluent tone fairly fluent tone nonfluent nontone chance

60

40

20

0 25 69 10 Age of Onset of Musical Training

Figure 5 Average percentage correct on a test of absolute pitch among students in a largescale study at an American music conservatory. Data are plotted as a function of age of onset of musical training and fluency in speaking a tone language. Those in groups tone very fluent, tone fairly fluent, and tone nonfluent were all of East Asian ethnic heritage and spoke a tone language with differing degrees of fluency. Those in the nontone group were Caucasian and spoke only nontone language. The line labeled chance represents chance performance on the task. Adapted from Deutsch, Dooley, et al. (2009).

Another point of interest in the study by Gregersen et al. is that the prevalence of AP was higher among the Chinese group than among the Japanese or Korean groups, and this prevalence in the latter groups was in turn higher than among the nontone language group. Japanese is a pitch accent language, so that the meanings of some words differ depending on the pitches of the syllables of which they are comprised. For example, in Tokyo Japanese the word hashi means chopsticks when it is pronounced high-low, bridge when it is pronounced low-high, and edge when the two syllables are the same in pitch. In Japanese, then, pitch also plays an important role in the attribution of lexical meaning; however, this role is not as critical as it is in tone languages. In Korea, some dialects are considered pitch accent or even tonal (Jun, Kim, Lee, & Jun, 2006). For example, in the Kyungsang dialect, the word son means grandchild or loss when spoken in a low tone, hand in a mid tone, and guest in a high tone. On the other hand, in Seoul Korean pitch is not used to convey lexical meaning. On these grounds, one would expect the overall prevalence of AP to be higher for speakers of Japanese and Korean than for speakers of nontone language, but not as high as for speakers of tone language. The survey findings of Gregersen et al. (1999, 2001) are as expected from this line of reasoning.

5. Absolute Pitch

157

E. Processing of Speech Sounds by AP Possessors


Evidence for enhanced processing of speech sounds has been found in AP possessors. In one experiment, Masataka (2011) required Japanese subjects to identify isolated syllables as rapidly as possible, and the mean response latency was found to be shorter for the AP possessors than for the nonpossessors. Because Japanese is a pitch accent language, this study left open the question of whether analogous findings would be obtained from speakers of nontone languages. However, Oechslin et al. (2010), in a study of German speakers, also found that AP possessors outperformed nonpossessors in tasks involving segmental speech processing.

V.

AP and Pitch Processing

It is often assumed that AP possessors have good earsthat is, that this ability is associated with enhanced low-level auditory abilities. However, experimental studies have not confirmed this view. For example, Sergeant (1969) and Siegel (1972) observed no difference between AP possessors and nonpossessors in their performance on frequency discrimination tasks. Fujisaki and Kashino (2002) confirmed the lack of difference between AP possessors and nonpossessors in frequency discrimination, and also found no difference between these two groups in the detection of tones in the presence of notched noise, in temporal gap discrimination, or in spatial resolution. On the other hand, AP possessors have been found to differ from nonpossessors in higher-level pitch processing, generally in advantageous ways. They exhibit categorical perception in note naming, while still discriminating between pitches within categories; they perform better on certain pitch memory tasks, on certain tasks involving the phonological processing of speech, and (except under unusual circumstances) in judging pitch relationships.

A. Categorical Perception of Pitch


AP possessors automatically encode pitches into categories that correspond to note names, and such categorical perception has been explored in several experiments. For example, Siegel and Siegel (1977) presented AP possessors with tones whose pitches varied in 20-cent increments, and found that identification judgments reflected categorical perception in semitone steps. Miyazaki (1988) obtained similar findings, which are illustrated in the judgments of one AP possessor shown in Figure 6. However, more complex results have also been obtained. Burns and Campbell (1994) tested AP possessors on a pitch identification task employing tones that varied in 25-cent increments. The results varied across subjects; for example, the judgments of one subject showed consistent categorization in semitone steps, whereas those of another subject reflected the use of 25-cent categories. Both Miyazaki (1988) and Burns and Campbell (1994) observed that in contrast to

158

Diana Deutsch

5 Number of Responses 4 3 2 1 0 Stimulus F G G A A

Figure 6 Distribution of note naming responses by a single absolute pitch possessor, indicating categorical perception. From Miyazaki (1988), with kind permission from Springer Science and Business Media.

categorical perception of speech sounds, for which discrimination functions are related to identification functions (Macmillan, Goldberg, & Braida, 1988), AP possessors discriminated between tones within categories while nevertheless exhibiting categorical perception in pitch identification tasks.

B. Pitch Memory
The ability of AP possessors to categorize and encode pitches in verbal form confers a considerable advantage to pitch memory. In an early experiment, Bachem (1954) compared the performance of AP possessors and musically trained nonpossessors on a pitch memory task. A standard tone was presented, followed by a comparison tone, and the subjects indicated whether the tones were the same or different in pitch. The two groups showed roughly the same decay rate of pitch memory during the first minute. However at longer retention intervals, the performance of the nonpossessors continued to deteriorate, while that of the AP possessors remained stablepresumably because they were encoding the pitches in the form of verbal labels. Indeed, when AP possessors were able to label the tones to be remembered, they performed accurately with retention intervals as long as 1 week. In a further study, Rakowski and Rogowski (2007) had subjects listen to a standard tone, and then tune a variable tone to match the pitch of the standard. When silent intervals of up to 1 minute were interposed between the tones, two AP possessors and a control nonpossessor exhibited very similar performance. However, beyond this period, the performance of the nonpossessor deteriorated with time, whereas that of the AP possessors remained more stable. In a more elaborate experiment, Siegel (1974) used a paradigm similar to that of Deutsch (1970). Subjects were presented with a test tone that was followed by a sequence of intervening tones and then by a second test tone, and they judged whether the test tones were the same or different in pitch. When the difference between the tones to be compared was 1/10 semitone, the performance of the AP possessors and nonpossessors declined at roughly the same rate over a 5-sec retention interval. However, when this difference was 1 semitone, the performance of the two groups diverged substantially: that of the AP possessors remained stable at a high level, while that of the nonpossessors deteriorated sharply over a 15-sec retention interval. These results indicated that the raw memory trace characteristics

5. Absolute Pitch

159

of the two groups were similar, but that because the AP possessors adopted a verbal encoding strategy, they were able to draw on long-term memory in making their judgments when the pitch difference between the tones to be compared was roughly a semitone. Following up on these findings, Ross and Marks (2009) suggested that children with minimal musical training who nevertheless show excellent short-term memory for pitch might be categorizing pitches in some way, and so might later develop AP as conventionally defined. The authors provided some preliminary evidence in favor of this hypothesis, and their intriguing suggestion awaits further investigation.

C. Octave Errors
While the performance of AP possessors and nonpossessors in judging the octave placement of tones has not yet been compared, a number of studies have shown that AP possessors sometimes make errors in judging octave placement, while correctly identifying the note names (Bachem, 1955; Lockhead & Byrd, 1981; Miyazaki, 1989). However, octave errors are difficult to interpret. In contrast to the standard terminology for designating pitch classes (C Cx, D, and so on), there is no standard terminology for designating octaves. Subjects might therefore be unfamiliar with the octave terminology employed in any given experiment, and this could lead to artifactual errors. As another point, tones that are built on the same fundamental but played on different instruments (such as piano and harpsichord) can differ in perceived height, and so in perceived octave. In relation to this, the perceived height of a tone can also be made to differ substantially by manipulating the relative amplitudes of its odd and even harmonics (Deutsch, Dooley, & Henthorn, 2008; Patterson, 1990; Patterson, Milroy, & Allerhand, 1993). The octave designation of a tone of unfamiliar timbre can therefore be problematic in principle.

D. Processing of Relative Pitch


AP possessors often feel uncomfortable when faced with arbitrarily transposed music, or when viewing a written score while simultaneously hearing the music played in a different key. This feeling of discomfort is understandable, because such listeners find the discrepancy between the notes they are viewing and hearing to be very salient. However, AP nonpossessors, who are often unaware of overall small pitch discrepancies, or at least regard them as fairly unimportant, sometimes find such a reaction puzzling, and may ascribe it to some cognitive or emotional problem. Indeed, because this reaction is often regarded as a sign of perceptual rigidity, several researchers have claimed that the AP possession confers a disadvantage to relative pitch processingand even to musicianship in general (cf. Miyazaki, 2004). Given that many world-class musicians are AP possessors, this claim appears highly implausible at face value; however, the evidence for and against it is here reviewed.

160

Diana Deutsch

Ward and Burns (1982) conjectured that the tendency for AP possessors to perceive pitches categorically might place them at a disadvantage in performing certain relative pitch tasks. Suppose, for example, that a listener were presented with C4 1 40 cents, followed by Dx4 2 40 cents. This produces an interval of 220 cents, and so should be recognized as a major second. However, an AP possessor might hypothetically perceive both the C and the Dx categorically, and so identify the interval as a minor third instead. This conjecture was evaluated by Benguerel and Westdal (1991), who found that only 1 out of 10 AP possessors made errors in interval identification on this basis, and even then did not do so consistently. However, Miyazaki (1992) found that a minority of AP possessors made more errors in identifying detuned intervals when the first tone comprising the interval deviated from equal tempered tuning, so indicating a small effect in this direction. Miyazaki (1993, 1995) further argued that AP possessors who were trained on a fixed-do system are subject to another source of error in making relative pitch judgments. He had subjects name intervals produced by tone pairs that were each preceded by a key-defining context (C, Fx, or a detuned E) created by a V7I chord, with the first note of the pair being the tonic defined by the chord. The performance of AP possessors was degraded in the Fx and detuned E contexts relative to the C context, and Miyazaki concluded that this was due to the influence on their judgments of a strong fixed-do template that was centered on C. However, the task employed by Miyazaki (1993, 1995) was an unusual one. The subjects, who had been trained in the fixed-do system, were required to designate ` ge names (do, re, mi, etc.) relative to C. For example, in the intervals using solfe this experiment the correct answer for the interval F-A (a major third) was mi; however, the subjects had also been taught to use the label mi to designate the note E. Therefore for key contexts other than C, the subjects were for the most part required to designate an interval by using a note name (do, re, mi, etc.) that differed from that of either of the presented notes. ` ge names to label intervals therefore proThe unusual requirement to use solfe duced a Stroop-like situation, so that AP possessors would be expected to experience confusion in performing this task. It was originally found by Stroop (1935) that when subjects were presented with the printed names of colors, their naming performance was impaired when there was a mismatch between the printed name and the color in which it was printed. An analogous effect was demonstrated by Zakay, Roziner, and Ben-Arzi (1984), who required AP possessors to identify the pitches of sung syllables, and found that their performance deteriorated when the syllables corresponded to the names of mismatched pitches. In a variant of this paradigm, Miyazaki (2004) reported that when a mismatch occurred between a syllable and the pitch in which it was sung, the pitch interfered with syllable naming for AP possessors. However AP nonpossessors, who would not have engaged in pitch naming in the first place, were not influenced by such a mismatch (see also Itoh et al., 2005). Hsieh and Saberi (2008) provided further evidence confirming the involvement of a Stroop effect in judgments made by fixed-do trained subjects. These authors

5. Absolute Pitch

161

` ge syllables. presented hybrid stimuli consisting of pitches that were voiced with solfe Subjects who had received fixed-do training (such as those studied by Miyazaki) showed substantial interference in pitch naming when the pitches and syllables were mismatched, whereas those who had received moveable-do training showed no such interference. A further study on the issue of relative pitch processing by AP possessors was prompted by the general impression that such individuals often feel uncomfortable when viewing a written score while hearing the music played in a different key. Miyazaki and Rakowski (2002) carried out an experiment to determine whether the performance of AP possessors might be degraded by a conflict between mismatched auditory and visual stimuli. Subjects were presented with a standard melody that was presented in a written score, together with an aurally presented comparison melody. On some trials, the comparison melody was at the same pitch level as the standard, while on other trials it was transposed up or down. Further, on some trials, the pitch relationships formed by the standard and comparison melodies were identical, and on other trials they differed, and subjects judged whether the melodies were the same or different. When the auditory and visual sequences were matched, the AP possessors outperformed the nonpossessors on this task; however, when the auditory sequences were transposed relative to the visual ones, the advantage to the AP possessors disappeared. In this latter condition, there was a marginal advantage to the AP nonpossessors, although this advantage became nonsignificant when the data of one anomalous borderline AP possessor was omitted. Yet the performance of the AP nonpossessors did not differ depending on whether the visually and aurally presented melodies were transposed relative to each other. Perhaps the AP possessors translated the visually presented notes into clearly imagined sounds, and this produced a conflict when they compared them with the transposed aurally presented melodies, whereas the nonpossessors viewed the written score in a more abstract fashion, so that no such conflict occurred. However, since the performance difference between the AP possessors and nonpossessors was only marginally significant, this issue awaits further investigation. Given these findings, Miyazaki (2004) speculated more generally that AP possessors might have a general impairment in relative pitch processing, and even that AP may be a disadvantage for musicianship (p. 428). However, because these experiments were such as to engender Stroop-type conflicts on the part of AP possessors, the question arises as to how such listeners would perform under more standard, and ecologically valid, conditions. Dooley and Deutsch (2010) addressed this question using a musical dictation task that was modeled after one used in the placement examination administered to first-year students in the University of Southern California Thornton School of Music. Thirty musically trained subjects were divided into three groupsAP possessors, borderline possessors, and nonpossessorsbased on their performance on the AP test used by Deutsch et al. (2006) and Deutsch, Dooley, et al. (2009). All subjects were given a musical dictation task that consisted of three passages that they transcribed in musical notation. The starting note was furnished for each

162

Diana Deutsch

passage in order to provide a reference. There was a strong positive relationship between performance on the AP test and the musical dictation tasks, and neither age of onset of musical training nor years of training were significantly related to the dictation scores. The performance level was significantly higher for the AP possessors than for the nonpossessors, for the AP possessors than for the borderline possessors, and for the borderline possessors than for the nonpossessors. In a further study, Dooley and Deutsch (2011) tested musically trained subjects consisting of 18 AP possessors and 18 nonpossessors, with the two groups matched for age and for age of onset and duration of musical training. The subjects performed interval-naming tasks that required only relative pitch. In contrast to the studies by Miyazaki (1993, 1995), the intervals were to be identified by their interval names (major second, minor third, and so on) so that no conflict was produced between the names that were used to designate the intervals and those of the notes forming the intervals. In one condition, the intervals were formed of brief sine waves that were just of sufficient duration to provide a clear sense of pitch (Hsieh & Saberi, 2007). In a second condition, piano tones were employed. A third condition was identical to the second, except that each interval was preceded by a V7I cadence such that the first tone of the pair would be interpreted as the tonic. Figure 7 shows, for each subject, the overall percentage correct in the intervalnaming task. As can be seen, AP possession was strongly and positively correlated with interval identification performance. Further, the advantage to AP possession held under all conditions of interval presentation. It is of particular interest that the AP advantage was not erased by providing the interval to be named with a tonal context. So together with the findings of Dooley and Deutsch (2010) on musical dictation tasks,
100 Interval Task Percent Correct

75

50

25

0 0 20 40 60 AP Test Percent Correct 80 100

Figure 7 Overall percentage correct on three interval naming tasks, plotted against percentage correct on a test for absolute pitch. A strong correlation emerged between absolute pitch possession and enhanced performance on the interval naming tasks. Adapted from Dooley and Deutsch (2011).

5. Absolute Pitch

163

the findings from this experiment indicate that AP possession is strongly associated with enhanced performance on musical tasks requiring only relative pitch, given standard musical situations.

VI.

Neuroanatomical Substrates of AP

A considerable body of evidence has accumulated showing that AP is associated with unique brain circuitry, and this has implicated regions that are known to be involved in pitch perception and categorization, memory, and speech processing. The studies have involved both structural and functional neuroimaging (Bermudez & Zatorre, 2009b; Keenan et al., 2001; Loui et al., 2011; Oechslin et al., 2010; Ohnishi et al., 2001; Schlaug et al., 1995; Schulze et al., 2009; Wilson et al., 2009; Zatorre, 2003; Zatorre et al., 1998), and the obtained findings presumably reflect both innate factors, as well as environmental influences that operate during an early critical period. One region that has been particularly implicated in AP is the left planum temporale (PT)a region in the temporal lobe that corresponds to the core of Wernickes area and that is essential to speech and language. The PT has been shown to be leftward asymmetric in most human brains (Geschwind & Levitsky, 1968), and in a seminal study, Schlaug et al. (1995) found that this asymmetry was exaggerated among AP possessors. Later, Zatorre et al. (1998) observed that the PT was larger in the left hemisphere among AP possessors than in a control group of subjects who were unselected for musical skill. Keenan et al. (2001) confirmed the exaggerated leftward asymmetry among AP possessors; however, in their study, this asymmetry was predominantly driven by a smaller right PT rather than a larger left one. Keenan et al. also found that the exaggerated leftward PT asymmetry did not occur in a control group of AP nonpossessors who had begun musical training at an early age. Wilson et al. (2009) confirmed Keenans findings and also reported that borderline AP possessors did not show the same exaggerated asymmetrya finding consistent with the conjecture that this group should be considered neurologically distinct from high-performing AP possessors. In line with the structural findings, Ohnishi et al. (2001) observed that AP possessors showed enhanced activation in the left PT during passive listening to music, and Oechslin et al. (2010) found that AP possessors showed enhanced activation in the left PT and surrounding regions while performing a segmental speech processing task. Leftward asymmetry of the PT has been observed in the human fetus (Wada, Clarke, & Harem, 1975), so these findings can be taken to argue for a geneticor at least innate component to the predisposition to acquire AP. Another region that has been implicated in AP is the left posterior dorsolateral frontal cortex. Zatorre et al. (1998) found that AP possessors showed enhanced activation in this region when covertly naming single tones, while nonpossessors showed activation in the same region when judging musical intervals. Taking into

164

Diana Deutsch

consideration other findings showing that this region is implicated in conditional associative learning (Petrides, 1985, 1990), Zatorre et al. hypothesized that AP possessors involve this region in the retrieval of associations between pitch values and their verbal labels (see also Bermudez & Zatorre, 2005). In line with these findings, Ohnishi et al. (2001) observed enhanced activation in the left posterior dorsolateral frontal cortex during a passive music listening task, and this correlated with high performance on an AP test. Further differences between AP possessors and nonpossessors have been found by Schulze et al. (2009) employing a short-term pitch memory task similar to that developed by Deutsch (1970, 1975). In general, these authors found enhanced temporal lobe activity in both groups during the first 3 seconds following stimulus presentation, presumably reflecting stimulus encoding. They also found continued strong activity in the frontal and parietal cortex during the next 3 seconds, presumably reflecting activity in the working memory system. AP possessors showed greater activity in the left superior temporal sulcus during the early encoding phase, whereas the nonpossessors showed greater activity in right parietal areas during both phases. The authors hypothesized that brain activation among AP possessors during the early encoding phase involved the categorization of tones into pitch classes, with the result that they were able to place less reliance on working memory in making their judgments. In line with this reasoning, Wilson et al. (2009) found that borderline AP possessors recruited a more extensive neural network in performing a pitch naming task than did highperforming AP possessors, with the latter group instead showing activation particularly in the left posterior superior temporal gyrus. The ability of AP possessors to place less reliance on working memory for pitch, owing to their enhanced ability to encode pitches in verbal form, could also account for their showing an absent or smaller P300 component of event-related potentials while performing pitch memory tasks (Hantz, Kreilick, Braveman, & Swartz, 1995; Hirose, Kubota, Kimura, Ohsawa, Yumoto, & Sakakihara, 2002; Klein, Coles, & Donchin, 1984; Wayman, Frisina, Walton, Hantz, & Crummer, 1992). This highlights the importance to AP of brain regions subserving pitch categorization discussed in Section V,A (Rakowski, 1993; Siegel, 1974; Siegel & Siegel, 1977). Interestingly, other studies have also associated the left superior temporal sulcus with sound identification and categorization (Liebenthal, Binder, Spitzer, Possing, & tto nen et al., 2006). Medler, 2005; Mo An intriguing recent development concerns the role of connectivity between brain regions that are critically involved in AP. Loui et al. (2011), using diffusion tensor imaging and tractography, found that AP possession was associated with hyperconnectivity in bilateral superior temporal lobe structures. Specifically, they found that tract volumes connecting the posterior superior temporal gyrus and the posterior medial temporal gyrus were larger among AP possessors than among nonpossessors. These differences in tract volumes were particularly strong in the left hemisphere and survived control for onset and duration of musical training. When AP possessors were grouped into those with very high scores and those with lower scores, it was found that the more accurate AP possessors also had larger tract volumes in the left hemisphere.

5. Absolute Pitch

165

VII.

AP Accuracy and Stimulus Characteristics

Among AP possessors, accuracy of note naming varies with the characteristics of the tones to be judged. Here we discuss effects of pitch classincluding the advantage of white key notes over black key notes, the effect of the octave in which the tone is placed, and the effect of timbre.

A. Pitch Class
AP possessors vary in the speed and accuracy with which they identify different pitch classes. In general, pitches that correspond to white keys on the keyboard C, D, E, F, G, A, Bare identified more accurately and rapidly than those that correspond to black keysCx/Dw, Dx/Ew, Fx/Gw, Gx/Aw, Ax/Bw (Athos et al., 2007; Baird, 1917; Bermudez & Zatorre, 2009a; Carroll, 1975; Deutsch et al., 2011; Marvin & Brinkman, 2000; Miyazaki, 1988, 1989, 1990; Sergeant, 1969; Takeuchi & Hulse, 1991, 1993). Two main explanations have been suggested for the black/white key effect. Miyazaki (1989, 1990) argued that most AP possessors begin musical training on the piano during the critical period for AP acquisition, and that such training typically commences with simple five-finger patterns using only white keys, with black keys being gradually introduced as training proceeds. He therefore proposed that the white-key advantage for AP judgments results from piano practice with these notes during early childhood. In support of this argument, Miyazaki and Ogawa (2006) performed a cross-sectional study on children aged 410 who were taking keyboard lessons, and found that, overall, the children acquired the ability to name the pitches of notes in the order of their appearance in the lessons. The hypothesis that the white-key advantage is due to early training on the piano was evaluated in the study by Deutsch et al. (2011). Here comparison was made between two groups of instrumentalists who began musical training at or before age 9. One group had begun training on the piano, and piano was currently their primary instrument; the other group had begun training on a non-keyboard instrument such as the violin, and they currently played a non-keyboard instrument. As shown in Figure 8, both groups showed a clear black/white key effect, and this was if anything stronger among those who were not keyboard performers. These findings argue that the black /white key effect cannot be attributed to early training on the white notes of the piano. Another explanation for the black/white key effect was advanced by Takeuchi and Hulse (1991). These authors pointed out that, based on general observation, in Western tonal music white-key pitches occur more frequently than black-key pitches, and so should be better processed. This explanation in terms of frequency of occurrence is in line with findings showing that in other tasks, such as lexical decision making and word naming, responses are faster and more accurate to frequently occurring words than to words that occur less frequently (Besner & McCann, 1987). In accordance with this hypothesis, Simpson and Huron (1994) determined the

166

Diana Deutsch

100 Black-key notes White-key notes Percentage Correct 90

80

70

60

Pianists

Orchestral Performers

Figure 8 Average percentage correct on a test of absolute pitch among students in a large scale study at the Shanghai Conservatory of Music, plotted separately for white-key and black-key pitches. Data from Deutsch, Le, et al. (2011).

frequency of occurrence of the different pitch classes from a sample of works by Bach and Haydn, and found that this distribution correlated significantly with the distribution of reaction times obtained by Miyazaki (1989) from seven AP possessors. Huron (2006) proposed, in agreement with Takeuchi and Hulse, that the prevalence of AP for the different pitch classes might differ in association with their frequency of occurrence in the music to which the listener has been exposed. In a large-scale analysis, Deutsch et al. (2011) plotted the percentage correct identifications of each pitch class, taking all 135 subjects in the study who had begun musical training at or before age nine. We correlated these percentages with the number of occurrences of each pitch class in Barlow and Morgensterns Electronic Dictionary of Musical Themes (2008)data kindly furnished us by David Huron. As shown in Figure 9, there was a highly significant correlation between note-naming accuracy and frequency of occurrence of the different pitch classes in this representative note collection. The result is particularly striking considering that the repertoire used in classes at the Shanghai Conservatory of Music, although having its primary input from Western tonal music, also has a larger input from Russian and Chinese music than occurs in Western music conservatories. Another approach to the effect of pitch class was advanced by Athos et al. (2007) in the Web-based study described earlier. They observed an overall tendency for subjects to misidentify notes as a semitone sharp (for example, to misidentify the note Dx as E). In particular, the note Gx was frequently misidentified as A. Based on the latter finding, the authors proposed that since Concert A is used as the reference for orchestra tuning, pitch class A might serve as a perceptual magnet (Kuhl, 1991), so enlarging the perceptual region assumed by listeners to correspond to this note. However, according to their hypothesis, one would expect the note A to be

5. Absolute Pitch

167

90

85
Note Named Correctly (%)

E B C F C G F A

80

75 D

70

65 G 60

R2 = 0.6489

20,000 10,000 15,000 Note Count from Classical Repertoire

25,000

Figure 9 Average percentage correct on a test of absolute pitch among students in a large scale study at the Shanghai Conservatory of Music, plotted for each pitch class separately, and against the number of occurrences of each pitch class in Barlow and Morgensterns Electronic Dictionary of Musical Themes (2008). From Deutsch, Le, et al. (2011).

most frequently identified correctly, yet Athos et al. did not obtain this finding. It appears, therefore, that the tendency to misidentify Gx as A can best be ascribed to the general tendency to misidentify notes in the sharp direction. In a further investigation of this issue, Deutsch et al. (2011) confirmed the general tendency to misidentify notes as a semitone sharp; however no special status for the note A was found. Specifically, the probability of misidentifying Gx as A was 7.9%, and of misidentifying Gx as G was 6.17%. However, the probability of misidentifying Ax as A was only 3.21%, whereas the probability of misidentifying Ax as B was 12.59%. So the findings from this study run counter to the hypothesis that the note A acts as a perceptual magnet. As a related issue, many musicians claim that they can identify a single reference pitch with easefor example, Concert A in the case of violinists, and Middle C in the case of pianists (Bachem, 1955; Baggaley, 1974; Baird, 1917; Balzano, 1984; Revesz, 1953; Seashore, 1940; Takeuchi & Hulse, 1993). However, formal testing with notes presented in random order has not confirmed this view (Takeuchi, 1989; Deutsch et al., 2011), so this informal impression might have been obtained from judgments made in particular musical settings. The conditions under which AP nonpossessors might identify a reference pitch with accuracy remain to be identified.

168

Diana Deutsch

B. Octave Placement
A number of studies have shown that AP possessors name notes most accurately when they are in central pitch registers (Bachem, 1948; Baird, 1917; Miyazaki, 1989; Rakowski, 1978; Rakowski & Morawska-Bungeler, 1987). It is to be expected that note-naming accuracy would be reduced at the high and low extremes of the musical range, because the musical aspect of pitch is here lost (Burns, 1999; Lockhead & Byrd, 1981; Pressnitzer, Patterson, & Krumbholz, 2001; Semal & Demany, 1990; Takeuchi & Hulse, 1993). However, note-naming accuracy has been found to vary depending on register in the middle of the musical range also. Miyazaki (1989) presented notes that ranged over seven octaves and found that best performance occurred for notes between C4 and C6, with performance declining on both sides of this range, and declining more steeply on the lower side, as shown in Figure 10. A similar result was obtained by Deutsch et al. (2011) considering only notes in the middle three octaves (C3-B5). Performance at the lower octave was here significantly worse than at the middle or higher octave, while the difference between the middle and higher octaves was not significant. On general grounds, the effect of register might relate to the frequency of occurrence of the different notes in Western music, though this conjecture awaits formal investigation.

C. Timbre
Although some AP possessors name pitches accurately regardless of how they are producedfor example, when they are produced by car horns, vacuum cleaners, air conditioners, and so onothers are accurate only for one or two instrument timbres with which they are familiar. Piano timbres appear to be particularly conducive to high levels of note naming (Athos et al., 2007; Baharloo et al., 1998; Lockhead &
100

80
Percent Correct

60

40 Piano Tone Complex Tone Pure Tone 1 2 3 4 5 Octave Position 6 7

Figure 10 Average percentage correct on a test of absolute pitch as a function of octave placement and instrument timbre. 15 C1-B1; 25C2-B2; 35C3-B3; 45 C4-B4; 55C5-B5; 6 5C6-B6; 7 5C7-B7. From Miyazaki (1989). 1989 Regents of the University of California.

20

5. Absolute Pitch

169

Byrd, 1981; Rakowski & Morawska-Bungeler, 1987; Takeuchi & Hulse, 1993; Ward, 1999). For example, Lockhead and Byrd (1981) found that listeners who scored 99% correct on piano tones scored only 58% correct (69% discounting octave errors) on pure tones. Miyazaki (1989) had seven AP possessors identify pure tones, complex pianolike tones, and piano tones. As shown in Figure 10, performance was most accurate for piano tones, less accurate for pianolike tones, and least accurate for pure tones. Further, in a large-scale study, Lee and Lee (2010) examined accuracy of note identification for synthesized piano, viola, and pure tones. They found a strong effect of timbre, with accuracy being highest for piano tones, lower for viola tones, and lowest for pure tones. Sergeant (1969) demonstrated a more general involvement of timbre in AP. He recorded tones from a number of different instruments and spliced out their initial portions, so rendering their timbres unfamiliar. Pitch identification suffered for the truncated tones, and Sergeant argued that the important factor here was not the pattern of harmonics, but rather overall familiarity with perceived sound quality. AP decisions therefore do not only involve the processing of pitch values, but are derived from evaluating the note as a whole, taken as a bundle of attribute values. This argument is in line with the conjecture that AP originally evolved to subserve speech sounds, which occur as bundles of features, such as consonants and vowels.

VIII. Pitch Shifts in AP Possessors


Although AP nonpossessors are able to detect pitch shifts of individual tones or groups of tones, with rare exceptions only AP possessors notice a shift of the entire tuning of the hearing mechanism. In particular, two sources of pitch shift have been identifiedthose occurring with advancing age and those associated with medication. These pitch shifts may well occur in the general population also, though AP nonpossessors might not be sensitive to them.

A. Association with Advancing Age


Beginning as early as age 4050, AP possessors generally find that pitches appear to be slightly sharper or flatter than they had been. People who have described such pitch shifts include J. F. Beck, who noticed at age 40 that he was beginning to hear notes a semitone sharp; this pitch shift progressed to two semitones at age 58, and to three semitones at age 71 (Ward, 1999). Also, P. E. Vernon (1977) observed that at age 52 he heard music a semitone too sharp and at age 71 as two semitones too sharp. On the other hand, some AP possessors have noted that pitches appear flattened instead, and yet others do not appear to experience a pitch shift with age (Carpenter, 1951).

170

Diana Deutsch

Athos et al. (2007), in their Web-based study, found that errors in pitch naming tended to increase with age, so that no subject in their study over 51 years of age identified all the tones in their test correctly. Such pitch shifts tended to be on the sharp side, though not consistently so. Athos et al. hypothesized that these pitch shifts could result from changes in the mechanical properties of the cochlea, though at present the physiological basis of this effect is unknown.

B. Association with Medication


Concerning pitch shifts resulting from medication, carbamazepinea drug that is widely used for the treatment of epilepsy and other disordershas been the subject of particular interest. A number of studies have shown that this drug produces a downward pitch shift of roughly a semitone, though fortunately the effect disappears rapidly when the drug is discontinued (Chaloupka, Mitchell, & Muirhead, 1994; Fujimoto, Enomoto, Takano, & Nose, 2004; Konno, Yamazaki, Kudo, Abe, & Tohgi, 2003; Tateno, Sawada, Takahashi, & Hujiwara, 2006; Yoshikawa & Abe, 2003). AP nonpossessors who have taken carbamazepine sometimes state that the drug causes pitches to appear abnormal, and a few nonpossessors have been able to pinpoint the direction of the pitch shift as downward. In contrast, AP possessors can document the pitch shift with confidence; indeed, they often find the effect disconcerting, with one patient reporting that it produced an unbearable sense of incongruity (Konno et al., 2003). Braun and Chaloupka (2005) carried out a detailed examination of the pitch shift under carbamazepine in a concert pianist. In a double-blind study involving all tones within a six-octave range, the subject shifted a mouse bar on a computer screen so as to match the visual representations of the presented tones with their perceived pitches in a fine-grained fashion. As shown in Figure 11, carbamazepine produced a downward pitch shift relative to placebo that was on average a little less than a semitone, with the extent of the shift increasing systematically from the lower to higher octaves. As another interesting finding, the black/white key effect persisted under carbamazepine. This applied to the pitches as they were perceived rather than to the tones as they were presented, indicating that the carbamazepine-induced effect occurs at a stage peripheral to that involved in the black/white key effect. Other than this, the neural basis of this pitch shift is unknown, though it has been hypothesized to be peripheral in origin (Braun & Chaloupka, 2005; Yoshikawa & Abe, 2003).

IX.

AP in Special Populations

The prevalence of AP is unusually high in certain rare populations. Interestingly, AP within these populations is associated with patterns of brain activation in response to sounds that differ from the patterns found among AP possessors within the general population.

5. Absolute Pitch

171

20

Median Deviation (cents)

20

40

Figure 11 Pitch shift induced by carbamazepine. The data show, for a single subject, the extent of the downward pitch shift induced by carbamazepine relative to placebo, as a function of the octave of the presented tone. Adapted from Braun and Chaloupka (2005).

60

80

100 Placebo Carbamazepine 120 C1C2 C2C3 C3C4 C4C5 C5C6 C6C7 Octave

AP is highly prevalent among blind musiciansboth those who are congenitally blind and those who have lost their vision very early in life (Bachem, 1940; Gaab, Schulze, Ozdemir, & Schlaug, 2006; Hamilton, Pascual-Leone, & Schlaug, 2004; Welch, 1988). For example, Hamilton et al. (2004) found that of 21 early blind subjects who were musically trained, 57% were AP possessors, some of whom had even begun taking music lessons in late childhood. The early blind, as a group, are also superior to sighted individuals in judging direction of pitch change, and in localizing sounds (Gougoux, Lepore, Lassonde, Voss, Zatorre, & Belin, 2004; Roder et al., 1999; Yabe & Kaga, 2005). It therefore appears that the high prevalence of AP in this group reflects a general shift in emphasis of brain resources from the visual to the auditory domain. Concerning neurological underpinnings, blind AP possessors have been found to produce more activation in nonauditory areas, such as visual and parietal areas, in performing pitch memory tasks (Ross, Olson & Gore, 2003; Gaab et al., 2006). In addition, Hamilton et al. (2004) observed a greater variability in PT asymmetry in early blind compared with sighted AP possessors. There is also evidence that AP is more prevalent among autistic individuals. Autism is a rare neurodevelopmental disorder characterized by intellectual and communicative deficits that occur in combination with islands of specific enhanced abilities. Extreme forms of this syndrome exist in autistic savants, who show extraordinary discrepancies between general cognitive impairments and spectacular achievements in specific domains. Their prodigious talents are often musical. AP is highly prevalent among musical savants in association with other exceptional

172

Diana Deutsch

musical abilities, for example in composing, performing, improvising, and remembering large segments of music following very little exposure (Mottron, Peretz, Belleville, & Rouleau, 1999; Miller, 1989; Young & Nettlebeck, 1995). Nonsavant autistic individuals often display a particular interest in music (Kanner, 1943; Rimland & Hill, 1984) and show substantially enhanced discrimination, categorization, and memory for the pitches of musical tones (Bonnel et al., 2003; Heaton, 2003, 2005, 2009; Heaton, Hermelin, & Pring, 1998) and speech rvinen-Pasley, Wallace, Ramus, Happe, & Heaton, 2008). It has been samples (Ja suggested that the superior categorization of sounds found in autistic individuals who lack musical training could indicate a predisposition to acquire AP (Heaton et al., 1998). As a caution, however, Heaton, Williams, Cummins, and Happe (2008) have pointed out that autistic persons who achieve discrepantly high scores on musical tasks might represent a specialized subgroup within the autistic population. With respect to neurological underpinnings, although abnormal PT volumes occur in autistic persons, this pattern of asymmetry is quite unlike that in normal AP possessors (Rojas, Bawn, Benkers, Reite, & Rogers, 2002; Rojas, Camou, Reite, & Rogers, 2005). Rojas et al. (2002) in a magnetic resonance imaging study, found that PT volume was significantly reduced in the left hemisphere among a group of autistic adults compared with normal controls. However the two groups showed no difference in the right hemisphere, so that the autistic group essentially exhibited symmetry of the left and right PT. Later Rojas et al. (2005) confirmed this pattern in autistic children. An enhanced prevalence of AP has also been hypothesized to exist among persons with Willliams syndrome. This is a rare neurodevelopmental disorder of genetic origin, characterized by mild to moderate intellectual deficits and distinctive facial features, together with other physiological abnormalities. Lenhoff, Perales, and Hickok (2001) found in an exploratory study that five individuals with Williams syndrome possessed AP, and they argued that this number was higher than might be expected; however, the relative incidence of AP among persons with Williams syndrome is at present unknown.

X.

Conclusion

Absolute pitch is an intriguing phenomenon that has long been the subject of considerable speculation and has recently drawn interest from researchers in a wide variety of disciplines, including music, psychology, neuroscience, and genetics. Although it had been considered an encapsulated ability, its study has contributed to the understanding of many issues, including critical periods in perceptual and cognitive development, relationships between language and music, the influence of language on perception, neuroanatomical correlates of specialized abilities, and the role of genetic factors in perception and cognition. The study of this ability should yield considerable dividends in the years to come.

5. Absolute Pitch

173

Acknowledgments
I am grateful to Trevor Henthorn for help with the illustrations, and to Frank Coffaro for help with formatting the references. Preparation of this chapter was supported in part by an Interdisciplinary Research Award to the author from the University of California, San Diego.

References
Athos, E. A., Levinson, B., Kistler, A., Zemansky, J., Bostrom, A., & Freimer, N., et al. (2007). Dichotomy and perceptual distortions in absolute pitch ability. Proceedings of the National Academy of Sciences, USA, 104, 1479514800. Bachem, A. (1940). The genesis of absolute pitch. Journal of the Acoustical Society of America, 11, 434439. Bachem, A. (1948). Chroma fixation at the ends of the musical frequency scale. Journal of the Acoustical Society of America, 20, 704705. Bachem, A. (1954). Time factors in relative and absolute pitch determination. Journal of the Acoustical Society of America, 26, 751753. Bachem, A. (1955). Absolute pitch. Journal of the Acoustical Society of America, 27, 11801185. Baggaley, J. (1974). Measurement of absolute pitch: a confused field. Psychology of Music, 2, 1117. Baharloo, S., Johnston, P. A., Service, S. K., Gitschier, J., & Freimer, N. B. (1998). Absolute pitch: an approach for identification of genetic and nongenetic components. American Journal of Human Genetics, 62, 224231. Baharloo, S., Service, S. K., Risch, N., Gitschier, J., & Freimer, N. B. (2000). Familial aggregation of absolute pitch. American Journal of Human Genetics, 67, 755758. Baird, J. W. (1917). Memory for absolute pitch. In E. C. Sanford (Ed.), Studies in psychology, Titchener commemorative volume (pp. 4378). Worcester, MA: Wilson. Balzano, G. J. (1984). Absolute pitch and pure tone identification. Journal of the Acoustical Society of America, 75, 623625. Barlow, H., & Morgenstern, S. (2008). The electronic dictionary of musical themes. The Multimedia Library. Bates, E. (1992). Language development. Current Opinion in Neurobiology, 2, 180185. Benguerel, A., & Westdal, C. (1991). Absolute pitch and the perception of sequential musical intervals. Music Perception, 9, 105119. Bergeson, T. R., & Trehub, S. E. (2002). Absolute pitch and tempo in mothers songs to infants. Psychological Science, 13, 7275. Bermudez, P., & Zatorre, R. J. (2005). Conditional associative memory for musical stimuli in nonmusicians: implications for absolute pitch. Journal of Neuroscience, 25, 77187723. Bermudez, P., & Zatorre, R. J. (2009a). A distribution of absolute pitch ability as revealed by computerized testing. Music Perception, 27, 89101. Bermudez, P., & Zatorre, R. J. (2009b). The absolute pitch mind continues to reveal itself. Journal of Biology, 8, 75. doi:10.1186/jbiol171 Besner, D., & McCann, R. S. (1987). Word frequency and pattern distortion in visual word identification and production: an examination of four classes of models. In M.

174

Diana Deutsch

Coltheart (Ed.), Attention and performance XII: The psychology of reading (pp. 201219). Hillsdale, NJ: Erlbaum. Bonnel, A., Mottron, L., Peretz, I., Trudel, M., Gallun, E., & Bonnel, A.-M. (2003). Enhanced pitch sensitivity in individuals with autism: a signal detection analysis. Journal of Cognitive Neuroscience, 15, 226235. Brady, P. T. (1970). Fixed scale mechanism of absolute pitch. Journal of the Acoustical Society of America, 48, 883887. Braun, M., & Chaloupka, V. (2005). Carbamazepine induced pitch shift and octave space representation. Hearing Research, 210, 8592. Burnham, D., & Brooker, R. (2002). Absolute pitch and lexical tones: Tone perception by non-musician, musician, and absolute pitch non-tonal language speakers. In J. Hansen, & B. Pellom (Eds.), The 7th International Conference on Spoken Language Processing (pp. 257260). Denver. Burns, E. M. (1999). Intervals, scales, and tuning. In D. Deutsch (Ed.), The psychology of music (2nd ed., pp. 215264). San Diego, CA: Academic Press. Burns, E. M., & Campbell, S. L. (1994). Frequency and frequency-ratio resolution by possessors of absolute and relative pitch: examples of categorical perception? Journal of the Acoustical Society of America, 96, 27042719. Carpenter, A. (1951). A case of absolute pitch. Quarterly Journal of Experimental Psychology, 3, 9293. Carroll, J. B. (1975). Speed and accuracy of absolute pitch judgments: some latter-day results. Educational Testing Service research bulletin. Princeton, NJ: Educational Testing Service (RB-75-35). Chalikia, M. H., & Leinfelt, F. (2000). Listeners in Sweden perceive tritone stimuli in a manner different from that of Americans and similar to that of British listeners. Journal of the Acoustical Society of America, 108, 2572. Chalikia, M. H., Norberg, A. M., & Paterakis, L. (2000). Greek bilingual listeners perceive the tritone stimuli differently from speakers of English. Journal of the Acoustical Society of America, 108, 2572. Chalikia, M. H., & Vaid, J. (1999). Perception of the tritone paradox by listeners in Texas: a re-examination of envelope effects. Journal of the Acoustical Society of America, 106, 2572. Chaloupka, V., Mitchell, S., & Muirhead, R. (1994). Observation of a reversible, medication-induced change in pitch perception. Journal of the Acoustical Society of America, 96, 145149. Corliss, E. L. (1973). Remark on fixed-scale mechanism of absolute pitch. Journal of the Acoustical Society of America, 53, 17371739. Cuddy, L. L. (1968). Practice effects in the absolute judgment of pitch. Journal of the Acoustical Society of America, 43, 10691076. Curtiss, S. (1977). Genie: A psycholinguistic study of a modern day wild child. New York, NY: Academic Press. Dawe, L. A., Platt, J. R., & Welsh, E. (1998). Spectral motion after-effects and the tritone paradox among Canadian subjects. Perception & Psychophysics, 60, 209220. Dennis, M., & Whitaker, H. A. (1976). Language acquisition following hemidecortication: linguistic superiority of the left over the right hemisphere. Brain and Language, 3, 404433. Deutsch, D. (1970). Tones and numbers: specificity of interference in short-term memory. Science, 168, 16041605.

5. Absolute Pitch

175

Deutsch, D. (1975). The organization of short-term memory for a single acoustic attribute. In D. Deutsch, & J. A. Deutsch (Eds.), Short-term memory (pp. l07l51). New York, NY: Academic Press. Deutsch, D. (1986). A musical paradox. Music Perception, 3, 275280. Deutsch, D. (1987). The tritone paradox: effects of spectral variables. Perception & Psychophysics, 42, 563575. Deutsch, D. (1988). The semitone paradox. Music Perception, 6, 115132. Deutsch, D. (1991). The tritone paradox: an influence of language on music perception. Music Perception, 8, 335347. Deutsch, D. (1992). Some new pitch paradoxes and their implications. Auditory Processing of Complex Sounds. Philosophical Transactions of the Royal Society, Series B, 336, 391397. Deutsch, D. (2002). The puzzle of absolute pitch. Current Directions in Psychological Science, 11, 200204. Deutsch, D. (1994). The tritone paradox: some further geographical correlates. Music Perception, 12, 125136. Deutsch, D., Dooley, K., & Henthorn, T. (2008). Pitch circularity from tones comprising full harmonic series. Journal of the Acoustical Society of America, 124, 589597. Deutsch, D., Dooley, K., Henthorn, T., & Head, B. (2009). Absolute pitch among students in an American music conservatory: association with tone language fluency. Journal of the Acoustical Society of America, 125, 23982403. Deutsch, D., Henthorn, T., & Dolson, M. (1999). Absolute pitch is demonstrated in speakers of tone languages. Journal of Acoustical Society of America, 106, 2267. Deutsch, D., Henthorn, T., & Dolson, M. (2004a). Absolute pitch, speech, and tone language: some experiments and a proposed framework. Music Perception, 21, 339356. Deutsch, D., Henthorn, T., & Dolson, M. (2004b). Speech patterns heard early in life influence later perception of the tritone paradox. Music Perception, 21, 357372. Deutsch, D., Henthorn, E., Marvin, W., & Xu, H.-S. (2006). Absolute pitch among American and Chinese conservatory students: prevalence differences, and evidence for speechrelated critical period. Journal of the Acoustical Society of America, 119, 719722. Deutsch, D., Kuyper, W. L., & Fisher, Y. (1987). The tritone paradox: its presence and form of distribution in a general population. Music Perception, 5, 7992. Deutsch, D., Le, J., Shen, J., & Henthorn, T. (2009). The pitch levels of female speech in two Chinese villages. Journal of the Acoustical Society of America Express Letters, 125, 208213. Deutsch, D., Le, J., Shen, J., & Li, X. (2011). Large-scale direct-test study reveals unexpected characteristics of absolute pitch. Journal of the Acoustical Society of America, 130, 2398. Deutsch, D., Moore, F. R., & Dolson, M. (1986). The perceived height of octave-related complexes. Journal of the Acoustical Society of America, 80, 13461353. Deutsch, D., North, T., & Ray, L. (1990). The tritone paradox: correlate with the listeners vocal range for speech. Music Perception, 7, 371384. Dolson, M. (1994). The pitch of speech as a function of linguistic community. Music Perception, 11, 321331. Dooley, K., & Deutsch, D. (2010). Absolute pitch correlates with high performance on musical dictation. Journal of the Acoustical Society of America, 128, 890893. Dooley, K., & Deutsch, D. (2011). Absolute pitch correlates with high performance on interval naming tasks. Journal of the Acoustical Society of America, 130, 40974104.

176

Diana Deutsch

Doupe, A. J., & Kuhl, P. K. (1999). Birdsong and human speech: common themes and mechanisms. Annual Review of Neuroscience, 22, 567631. Drayna, D. T. (2007). Absolute pitch: A special group of ears. Proceedings of the National Academy of Sciences, U.S.A, 104, 1454914550. Duchowny, M., Jayakar, P., Harvey, A. S., Resnick, T., Alvarez, L., & Dean, P., et al. (1996). Language cortex representation: effects of developmental versus acquired pathology. Annals of Neurology, 40, 3138. Edmondson, J. A., Chan, J.-L., Seibert, G. B., & Ross, E. D. (1987). The effect of right brain damage on acoustical measures of affective prosody in Taiwanese patients. Journal of Phonetics, 15, 219233. Fujimoto, A., Enomoto, T., Takano, S., & Nose, T. (2004). Pitch perception abnormality as a side effect of carbamazepine. Journal of Clinical Neuroscience, 11, 6970. Fujisaki, W., & Kashino, M. (2002). The basic hearing abilities of absolute pitch possessors. Acoustical Science and Technology, 23, 7783. Gaab, N., Schulze, K., Ozdemir, E., & Schlaug, G. (2006). Neural correlates of absolute pitch differ between blind and sighted musicians. NeuroReport, 17, 18531857. Gandour, J., & Dardarananda, R. (1983). Identification of tonal contrasts in Thai aphasic patients. Brain and Language, 18, 98114. Gandour, J., Ponglorpisit, S., Khunadorn, F., Dechongkit, S., Boongird, P., & Boonklam, R., et al. (1992). Lexical tones in Thai after unilateral brain damage. Brain and Language, 43, 275307. Gandour, J., Wong, D., & Hutchins, G. (1998). Pitch processing in the human brain is influenced by language experience. Neuroreport, 9, 21152119. Geschwind, N., & Fusillo, M. (1966). Color-naming defects in association with alexia. Archives of Neurology, 15, 137146. Geschwind, N., & Levitsky, W. (1968). Human brain: leftright asymmetries in temporal speech region. Science, 161, 186187. Giangrande, J. (1998). The tritone paradox: effects of pitch class and position of the spectral envelope. Music Perception, 15, 253264. Gorelick, P. B., & Ross, E. D. (1987). The aprosodias: further functional-anatomic evidence for organization of affective language in the right hemisphere. Journal of Neurology, Neurosurgery, and Psychiatry, 50, 553560. Gough, E. (1922). The effects of practice on judgments of absolute pitch. Archives of Psychology, 7, 193. Gougoux, F., Lepore, F., Lassonde, M., Voss, P., Zatorre, R. J., & Belin, P. (2004). Pitch discrimination in the early blind. Nature, 430, 309. Gregersen, P. K., Kowalsky, E., Kohn, N., & Marvin, E. W. (1999). Absolute pitch: prevalence, ethnic variation, and estimation of the genetic component. American Journal of Human Genetics, 65, 911913. Gregersen, P. K., Kowalsky, E., Kohn, N., & Marvin, E. W. (2001). Early childhood music education and predisposition to absolute pitch: teasing apart genes and environment. American Journal of Medical Genetics, 98, 280282. Gussmack, M. B., Vitouch, O., & Gula, B. (2006). Latent absolute pitch: An ordinary ability? In M. Baroni, A. R. Addessi, R. Caterina, & M. Costa (Eds.), Proceedings of the 9th International Conference on Music Perception and Cognition (pp. 14081412). Bologna, Italy: Bononia University Press. Halpern, A. R. (1989). Memory for the absolute pitch of familiar songs. Memory and Cognition, 17, 572581.

5. Absolute Pitch

177

Hamilton, R. H., Pascual-Leone, A., & Schlaug, G. (2004). Absolute pitch in blind musicians. NeuroReport, 15, 803806. Hantz, E. C., Kreilick, K. G., Braveman, A. L., & Swartz, K. P. (1995). Effects of musical training and absolute pitch on a pitch memory task an event-related-potential study. Psychomusicology, 14, 5376. Heaton, P. (2003). Pitch memory, labelling and disembedding in autism. Journal of Child Psychology and Psychiatry, 44, 19. Heaton, P. (2005). Interval and contour processing in autism. Journal of Autism and Developmental Disorders, 8, 17. Heaton, P. (2009). Assessing musical skills in autistic children who are not savants. Philosophical Transactions of the Royal Society B, 364, 14431447. Heaton, P., Hermelin, B., & Pring, L. (1998). Autism and pitch processing: a precursor for savant musical ability? Music Perception, 15, 291305. Heaton, P., Williams, K., Cummins, O., & Happe, F. (2008). Autism and pitch processing splinter skills. Autism, 12, 203219. Heller, M. A., & Auerbach, C. (1972). Practice effects in the absolute judgment of frequency. Psychonomic Science, 26, 222224. Henthorn, T., & Deutsch, D. (2007). Ethnicity versus early environment: Comment on Early Childhood Music Education and Predisposition to Absolute Pitch: Teasing Apart Genes and Environment by Peter K. Gregersen, Elena Kowalsky, Nina Kohn, and Elizabeth West Marvin [2000]. American Journal of Medical Genetics, 143A, 102103. Hess, E. H. (1973). Imprinting: Early experience and the developmental psychobiology of attachment. New York, NY: Van Nordstrand Reinhold. Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews Neuroscience, 8, 393402. Hirose, H., Kubota, M., Kimura, I., Ohsawa, M., Yumoto, M., & Sakakihara, Y. (2002). People with absolute pitch process tones with producing P300. Neuroscience Letters, 330, 247250. Hsieh, I.-H., & Saberi, K. (2007). Temporal integration in absolute identification of musical pitch. Hearing Research, 233, 108116. Hsieh, I.-H., & Saberi, K. (2008). Language-selective interference with long-term memory for musical pitch. Acta Acustica united with Acustica, 94, 588593. Hubel, D. H., & Wiesel, T. N. (1970). The period of susceptibility to the physiological effects of unilateral eye closure in kittens. Journal of Physiology, 206, 419436. Hughes, C. P., Chan, J. L., & Su, M. S. (1983). Aprosodia in Chinese patients with right cerebral hemisphere lesions. Archives of Neurology, 40, 732736. Huron, D. (2006). Sweet anticipation. Cambridge, MA: MIT Press. Itoh, K., Suwazono, S., Arao, H., Miyazaki, K., & Nakada, T. (2005). Electrophysiological correlates of absolute pitch and relative pitch. Cerebral Cortex, 15, 760769. rvinen-Pasley, A., Wallace, G. L., Ramus, F., Happe, F., & Heaton, P. (2008). Enhanced Ja perceptual processing of speech in autism. Developmental Science, 11, 109121. Johnson, J. S., & Newport, E. L. (1989). Critical periods in second language learning: the influence of maturational state on the acquisition of English as a second language. Cognitive Psychology, 21, 6099. Jun, J., Kim, J., Lee, H., & Jun, S. -A. (2006). The prosodic structure and pitch accent of Northern Kyungsang Korean. Journal of East Asian Linguistics, 15, 289317. Kanner, L. (1943). Autistic disturbances of affective contact. The Nervous Child, 2, 217250.

178

Diana Deutsch

Keenan, J. P., Thangaraj, V., Halpern, A. R., & Schlaug, G. (2001). Absolute pitch and planum temporale. NeuroImage, 14, 14021408. Klein, M., Coles, M. G. H., & Donchin, E. (1984). People with absolute pitch process tones without producing a P300. Science, 223, 13061309. Knudsen, E. I. (1988). Sensitive and critical periods in the development of sound localization. In S. S. Easter, K. F. Barald, & B. M. Carlson (Eds.), From message to mind: Directions in developmental neurobiology. Sunderland, MA: Sinauer Associates. Konno, S., Yamazaki, E., Kudoh, M., Abe, T., & Tohgi, H. (2003). Half pitch lower sound perception caused by carbamazepine. Internal Medicine, 42, 880883. Kuhl, P. K. (1991). Human adults and human infants show a perceptual magnet effect for the prototypes of speech categories, monkeys do not. Perception & Psychophysics, 50, 93107. Kuhl, P., Williams, K., Lacerda, F., Stevens, K., & Lindblom, B. (1992). Linguistic experience alters phonetic perception in infants by 6 months of age. Science, 255, 606608. Lane, H. L. (1976). The wild boy of Aveyron. Cambridge, MA: Harvard University Press. Lee, C.-Y., & Lee, Y.-F. (2010). Perception of musical pitch and lexical tones by Mandarinspeaking musicians. Journal of the Acoustical Society of America, 127, 481490. Lenhoff, H. M., Perales, O., & Hickok, G. (2001). Absolute pitch in Williams syndrome. Music Perception, 18, 491503. Lennenberg, E. H. (1967). Biological foundations of language. New York, NY: Wiley. Levitin, D. J. (1994). Absolute memory for musical pitch: evidence for the production of learned melodies. Perception & Psychophysics, 56, 414423. Levitin, D. J., & Rogers, S. E. (2005). Absolute pitch: Perception, coding, and controversies. Trends in Cognitive Science, 9, 2633. Liebenthal, E., Binder, J. R., Spitzer, S. M., Possing, E. T., & Medler, D. A. (2005). Neural substrates of phonemic perception. Cerebral Cortex, 15, 16211631. Lockhead, G. R., & Byrd, R. (1981). Practically perfect pitch. Journal of the Acoustical Society of America, 70, 387389. Loui, P., Li, H., Hohmann, A., & Schlaug, G. (2011). Enhanced cortical connectivity in absolute pitch musicians: a model for local hyperconnectivity. Journal of Cognitive Neuroscience, 23, 10151026. Macmillan, N. A., Goldberg, R. F., & Braida, L. D. (1988). Resolution for speech sounds: basic sensitivity and context memory on vowel and consonant continua. Journal of the Acoustical Society of America, 84, 12621280. Marvin, E. W., & Brinkman, A. R. (2000). The effect of key color and timbre on absolute pitch recognition in musical contexts. Music Perception, 18, 111137. Masataka, N. (2011). Enhancement of speech-relevant auditory acuity in absolute pitch possessors. Frontiers in Psychology, 2, 14. Meyer, M. (1899). Is the memory of absolute pitch capable of development by training? Psychological Review, 6, 514516. Miller, L. (1989). Musical savants: Exceptional skills in the mentally retarded. Hillsdale, NJ: Erlbaum. Miyazaki, K. (1988). Musical pitch identification by absolute pitch possessors. Perception & Psychophysics, 44, 501512. Miyazaki, K. (1989). Absolute pitch identification: effects of timbre and pitch region. Music Perception, 7, 114. Miyazaki, K. (1990). The speed of musical pitch identification by absolute pitch possessors. Music Perception, 8, 177188.

5. Absolute Pitch

179

Miyazaki, K. (1992). Perception of musical intervals by absolute pitch possessors. Music Perception, 9, 413426. Miyazaki, K. (1993). Absolute pitch as an inability: identification of musical intervals in a tonal context. Music Perception, 11, 5572. Miyazaki, K. (1995). Perception of relative pitch with different references: some absolutepitch listeners cant tell musical interval names. Perception & Psychophysics, 57, 962970. Miyazaki, K. (2004). How well do we understand pitch? Acoustical Science and Technology, 25, 426432. Miyazaki, K., & Ogawa, Y. (2006). Learning absolute pitch by children: a cross-sectional study. Music Perception, 24, 6378. Miyazaki, K., & Rakowski, A. (2002). Recognition of notated melodies by possessors and nonpossessors of absolute pitch. Perception & Psychophysics, 64, 13371345. Moen, I., & Sundet, K. (1996). Production and perception of word tones (pitch accents) in patients with left and right hemisphere damage. Brain and Language, 53, 267281. tto nen, R., Calvert, G. A., Ja a skela inen, I. P., Matthews, P. M., Thesen, T., & Mo Tuomainen, J., et al. (2006). Perceiving identical sounds as speech or non-speech modulates activity in the left posterior superior temporal sulcus. Neuroimage, 30, 563569. Mottron, L., Peretz, I., Belleville, S., & Rouleau, N. (1999). Absolute pitch in autism: a case study. Neurocase, 5, 485501. Mull, H. K. (1925). The acquisition of absolute pitch. American Journal of Psychology, 36, 469493. Naeser, M. A., & Chan, S. W.-C. (1980). Case study of a Chinese aphasic with the Boston diagnostic aphasia exam. Neuropsychologia, 18, 389410. Newport, E. L. (1990). Maturational constraints on language learning. Cognitive Science, 14, 1128. Newport, E. L., Bavelier, D., & Neville, H. J. (2001). Critical thinking about critical periods. In E. Dupoux (Ed.), Language, brain, and cognitive development: Essays in honor of Jacques Mehler. Cambridge, MA: MIT Press. ncke, L. (2010). Absolute pitch: functional evidence of Oechslin, M. S., Meyer, M., & Ja speech-relevant auditory acuity. Cerebral Cortex, 20, 447455. Ohnishi, T., Matsuda, H., Asada, T., Atuga, M., Hirakata, M., & Nishikawa, M., et al. (2001). Functional anatomy of musical perception in musicians. Cerebral Cortex, 11, 754760. Packard, J. L. (1986). Tone production deficits in nonfluent aphasic Chinese speech. Brain and Language, 29, 212223. Patkowski, M. S. (1990). Age and accent in a second language: a reply to James Emil Flege. Applied Linguistics, 11, 7389. Patterson, R. D. (1990). The tone height of multiharmonic sounds. Music Perception, 8, 203214. Patterson, R. D., Milroy, R., & Allerhand, M. (1993). What is the octave of a harmonically rich note? Contemporary Music Review, 9, 6981. Peng, G., Deutsch, D., Henthorn, T., Su, D.-J., & Wang, W. S.-Y. (in press). Language experience influences nonlinguistic pitch perception. Journal of Chinese Linguistics. Petrides, M. (1985). Deficits in non-spatial conditional associative learning after periarcuate lesions in the monkey. Behavioral Brain Research, 16, 95101. Petrides, M. (1990). Nonspatial conditional learning impaired in patients with unilateral frontal but not unilateral temporal lobe excisions. Neuropsychologia, 28, 137149.

180

Diana Deutsch

Pressnitzer, D., Patterson, R. D., & Krumbholz, K. (2001). The lower limit of melodic pitch. Journal of the Acoustical Society of America, 109, 20742084. Profita, J., & Bidder, T. G. (1988). Perfect pitch. American Journal of Medical Genetics, 29, 763771. Ragozzine, R., & Deutsch, D. (1994). A regional difference in perception of the tritone paradox within the United States. Music Perception, 12, 213225. Rakowski, A. (1978). Investigations of absolute pitch. In E. P. Asmus, Jr. (Ed.), Proceedings of the Research Symposium on the Psychology and Acoustics of Music (pp. 4557). Lawrence: University of Kansas. Rakowski, A. (1993). Categorical perception in absolute pitch. Archives of Acoustics, 18, 515523. Rakowski, A., & Miyazaki, K. (2007). Absolute pitch: common traits in music and language. Archives of Acoustics, 32, 516. Rakowski, A., & Morawska-Bungeler, M. (1987). In search of the criteria for absolute pitch. Archives of Acoustics, 12, 7587. Rakowski, A., & Rogowski, P. (2007). Experiments on long-term and short-term memory for pitch in musicians. Archives of Acoustics, 32, 815826. Repp, B. H., & Thompson, J. M. (2010). Context sensitivity and invariance in perception of octave-ambiguous tones. Psychological Research, 74, 437456. Revesz, G. (1953). Introduction to the psychology of music. London, England: Longmans Green. Rimland, B., & Hill, A. (1984). Idiot savants. In J. Wortes (Ed.), Mental retardation and developmental disabilities (pp. 155169). New York, NY: Plenum Press. Roder, B., Teder-Salejarvi, W., Sterr, A., Rosler, F., Hillyard, S. A., & Neville, H. J. (1999). Improved auditory spatial tuning in blind humans. Nature, 400, 162165. Rojas, D. C., Bawn, S. D., Benkers, T. L., Reite, M. L., & Rogers, S. J. (2002). Smaller left misphe ` re planum temporale in adults with autistic disorder. Neuroscience Letters, he 328, 237240. Rojas, D. C., Camou, S. L., Reite, M. L., & Rogers, S. J. (2005). Planum temporale volume in children and adolescents with autism. Journal of Autism and Developmental Disorders, 35, 479486. Ross, D. A., & Marks, L. E. (2009). Absolute pitch in children prior to the beginning of musical training. Annals of the New York Academy of Sciences, 1169, 199204. Ross, D. A., Olson, I. R., & Gore, J. C. (2003). Cortical plasticity in an early blind musician: an fMRl study. Magnetic Resonance Imaging, 21, 821828. Ross, E. D. (1981). The aprosodias: functionalanatomic organization of the affective components of language in the right hemisphere. Archives of Neurology, 38, 561569. Russo, F. A., Windell, D. L., & Cuddy, L. L. (2003). Learning the special note: evidence for a critical period for absolute pitch acquisition. Music Perception, 21, 119127. Saffran, J. R., & Griepentrog, G. J. (2001). Absolute pitch in infant auditory learning: evidence for developmental reorganization. Developmental Psychology, 37, 7485. Sakai, K. L. (2005). Language acquisition and brain development. Science, 310, 815819. Schellenberg, E. G., & Trehub, S. E. (2003). Good pitch memory is widespread. Psychological Science, 14, 262266. ncke, L., Huang, Y., & Steinmetz, H. (1995). In vivo evidence of structural Schlaug, G., Ja brain asymmetry in musicians. Science, 267, 699701. Schulze, K., Gaab, N., & Schlaug, G. (2009). Perceiving pitch absolutely: comparing absolute and relative pitch possessors in a pitch memory task. BMC Neuroscience, 10, 14712202.

5. Absolute Pitch

181

Scovel, T. (1969). Foreign accent, language acquisition, and cerebral dominance. Language Learning, 19, 245253. Seashore, C. E. (1940). Acquired pitch vs. absolute pitch. Music Education Journal, 26, 18. Semal, C., & Demany, L. (1990). The upper limit of "musical" pitch. Music Perception, 8, 165176. Sergeant, D. (1969). Experimental investigation of absolute pitch. Journal of Research in Musical Education, 17, 135143. Siegel, J. A. (1972). The nature of absolute pitch. In E. Gordon (Ed.), Experimental research in the psychology of music: VIII. Studies in the psychology of music (pp. 6589). Iowa City: Iowa University Press. Siegel, J. A. (1974). Sensory and verbal coding strategies in subjects with absolute pitch. Journal of Experimental Psychology, 103, 3744. Siegel, J. A., & Siegel, W. (1977). Absolute identification of notes and intervals by musicians. Perception & Psychophysics, 21, 143152. Simpson, J., & Huron, D. (1994). Absolute pitch as a learned phenomenon: evidence consistent with the HickHyman Law. Music Perception, 12, 267270. Smith, N. A., & Schmuckler, M. A. (2008). Dial A440 for absolute pitch: absolute pitch memory by non-absolute pitch possessors. Journal of the Acoustical Society of America, 123, EL77EL84. Spender, N. (1980). Absolute pitch. In S. Sadie (Ed.), The new Grove dictionary of music and musicians (pp. 2729). London, England: Macmillan. Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18, 643662. Takeuchi, A. H. (1989). Absolute pitch and response time: The processes of absolute pitch identification (Unpublished master s thesis). Johns Hopkins University, Baltimore, MD. Takeuchi, A. H., & Hulse, S. H. (1991). Absolute-pitch judgments of black and white-key pitches. Music Perception, 9, 2746. Takeuchi, A. H., & Hulse, S. H. (1993). Absolute pitch. Psychological Bulletin, 113, 345361. Tateno, A., Sawada, K., Takahashi, I., & Hujiwara, Y. (2006). Carbamazepine-induced transient auditory pitch-perception deficit. Pediatric Neurology, 35, 131134. Terhardt, E., & Seewann, M. (1983). Aural key identification and its relationship to absolute pitch. Music Perception, 1, 6383. Terhardt, E., & Ward, W. D. (1982). Recognition of musical key: exploratory study. Journal of the Acoustical Society of America, 72, 2633. Theusch, E., Basu, A., & Gitschier, J. (2009). Genome-wide study of families with absolute pitch reveals linkage to 8q24.21 and locus heterogeneity. American Journal of Human Genetics, 85, 112119. Trehub, S. E., Schellenberg, E. G., & Nakata, T. (2008). Cross-cultural perspectives on pitch memory. Journal of Experimental Child Psychology, 100, 4052. Trout, J. D. (2003). Biological specializations for speech: what can the animals tell us? Current Directions in Psychological Science, 12, 155159. Tucker, D. M., Watson, R. T., & Heilman, K. M. (1977). Discrimination and evocation of affectively intoned speech in patients with right parietal disease. Neurology, 27, 947950. van Krevelen, A. (1951). The ability to make absolute judgements of pitch. Journal of Experimental Psychology, 42, 207215. Van Lancker, D., & Fromkin, V. (1973). Hemispheric specialization for pitch and tone: Evidence from Thai. Journal of Phonetics, 1, 101109.

182

Diana Deutsch

Varyha-Khadem, F., Carr, L. J., Isaacs, E., Brett, E., Adams, C., & Mishkin, M. (1997). Onset of speech after left hemispherectomy in a nine year old boy. Brain, 120, 159182. Vernon, E. (1977). Absolute pitch: A case study. British Journal of Psychology, 83, 485489. Vitouch, O. (2003). Absolutist models of absolute pitch are absolutely misleading. Music Perception, 21, 111117. Vitouch, O., & Gaugusch, A. (2000). Absolute recognition of musical keys in nonabsolute-pitch-possessors. In C. Woods, G. Luck, R. Brochard, F. Seddon, & J. A. Sloboda (Eds.), Proceedings of the 6th International Conference on Music Perception and Cognition [CD-ROM]. Keele, UK: Dept. of Psychology, Keele University. Wada, J. A., Clarke, R., & Harem, A. (1975). Cerebral hemispheric asymmetry in humans: cortical speech zones in 100 adult and100 infant brains. Archives of Neurology, 32, 239246. Ward, W. D. (1999). Absolute pitch. In D. Deutsch (Ed.), The psychology of music (pp. 265298). San Diego, CA: Academic Press. Ward, W. D., & Burns, E. M. (1982). Absolute pitch. In D. Deutsch (Ed.), The psychology of music (pp. 431451). San Diego, CA: Academic Press. Wayland, R. P., & Guion, S. G. (2004). Training English and Chinese listeners to perceive Thai tones: a preliminary report. Language Learning, 54, 681712. Wayman, J. W., Frisina, R. D., Walton, J. P., Hantz, E. C., & Crummer, G. C. (1992). Effects of musical training and absolute pitch ability on event-related activity in response to sine tones. Journal of the Acoustical Society of America, 91, 35273531. Wedell, C. H. (1934). The nature of the absolute judgment of pitch. Journal of Experimental Psychology, 17, 485503. Welch, G. F. (1988). Observations on the incidence of absolute pitch (AP) ability in the early blind. Psychology of Music, 16, 7780. Werker, J., & Lalonde, C. (1988). Cross-language speech perception: initial capabilities and developmental change. Developmental Psychology, 24, 672683. Wilson, S. J., Lusher, D., Wan, C. Y., Dudgeon, P., & Reutens, D. C. (2009). The neurocognitive components of pitch processing: insights from absolute pitch. Cerebral Cortex, 19, 724732. Woods, B. T. (1983). Is the left hemisphere specialized for language at birth? Trends in Neuroscience, 6, 115117. Yabe, T., & Kaga, K. (2005). Sound lateralization test in adolescent blind individuals. Neuroreport, 16, 939942. Yoshikawa, H., & Abe, T. (2003). Carbamazepine-induced abnormal pitch perception. Brain Development, 25, 127129. Young, R., & Nettlebeck, T. (1995). The abilities of a musical savant and his family. Journal of Autism and Developmental Disorders, 25, 229245. r Zakay, D., Roziner, I., & Ben-Arzi, S. (1984). On the nature of absolute pitch. Archive fu Psychologie, 136, 163166. Zatorre, R. J. (2003). Absolute pitch: a model for understanding the influence of genes and development on cognitive function. Nature Neuroscience, 6, 692695. Zatorre, R. J., Perry, D. W., Beckett, C. A., Westbury, C. F., & Evans, A. C. (1998). Functional anatomy of musical processing in listeners with absolute pitch and relative pitch. Proceedings of the National Academy of Sciences, 95, 31723177.

6 Grouping Mechanisms in Music


Diana Deutsch
Department of Psychology, University of California, San Diego, La Jolla, California

I.

Introduction

Music provides us with a complex, rapidly changing acoustic spectrum, often derived from the superposition of sounds from many different sources. Our auditory system has the task of analyzing this spectrum so as to reconstruct the originating sound eventsa task often referred to as auditory scene analysis (Bregman, 1990). This is analogous to the task performed by our visual system when it interprets the mosaic of light impinging on the retina in terms of visually perceived objects. Such a view of perception as a process of unconscious inference was proposed in the last century by Helmholtz (19091911/1925), and we shall see that many phenomena of music perception can be viewed in this light. Several issues are considered here. First, given that our auditory system is presented with a set of low-level elements, we can explore the ways in which these are combined so as to form separate groupings. If all low-level elements were indiscriminately linked together, auditory shape recognition operations could not be performed. There must, therefore, be a set of mechanisms that enable us to form linkages between some low-level elements and inhibit us from forming linkages between others. In examining such linkages, we can follow two lines of inquiry. The first concerns the dimensions along which grouping principles operate. When presented with a complex pattern, the auditory system groups elements according to some rule based on frequency, amplitude, timing, spatial location, or some multidimensional attribute such as timbre. As we shall see, any of these attributes can be used as a basis for grouping; however, the conditions that determine which attribute is followed are complex ones. Second, assuming that organization takes place on the basis of some dimension such as pitch, we can inquire into the principles that govern grouping along this dimension. The early Gestalt psychologists proposed that we group elements into configurations on the basis of various simple rules (Wertheimer, 1923). One is proximity: closer elements are grouped together in preference to those that are further apart. An example is shown in Figure 1a, where the closer dots are perceptually grouped together in pairs. Another is similarity: in viewing Figure 1b,
The Psychology of Music. DOI: http://dx.doi.org/10.1016/B978-0-12-381460-9.00006-7 2013 Elsevier Inc. All rights reserved.

184

Diana Deutsch

a D A C B b c

Figure 1 Illustrations of the Gestalt principles of proximity (a), similarity (b), and good continuation (c).

we perceive one set of vertical rows formed by the filled circles and another formed by the unfilled circles. A third, good continuation, states that elements that follow each other in a given direction are perceptually linked together: We group the dots in Figure 1c so as to form the two lines AB and CD. A fourth, common fate, states that elements that change in the same way are perceptually linked together. As a fifth principle, we tend to form groupings so as to perceive configurations that are familiar to us. It is reasonable to assume that grouping in conformity with such principles enables us to interpret our environment most effectively. In the case of vision, elements that are close in space are more likely to belong to the same object than are elements that are spaced further apart. The same line of reasoning holds for elements that are similar rather than those that are dissimilar. In the case of hearing, similar sounds are likely to have originated from a common source, and dissimilar sounds from different sources. A sequence that changes smoothly in frequency is likely to have originated from a single source, whereas an abrupt frequency transition may reflect the presence of a new source. Components of a complex spectrum that arise in synchrony are likely to have emanated from the same source, and the sudden addition of a new component may signal the emergence of a new source. As a related question, we can ask whether the perceptual grouping of elements such as frequency and spatial location results from the action of a single, overarching decision mechanism or from multiple decision mechanisms, each with its own grouping criteria. As will be described, the evidence shows that grouping decisions are not made by a single, internally coherent, system, but rather by a number of different subsystems, which at some stage act independently of each other, and can arrive at inconsistent conclusions. For example, the sound elements that are assigned to different sources so as to determine perceived pitch can differ from those that are assigned to determine perceived timbre, loudness, and location. From such findings, we must conclude that perceptual organization in music involves a process in which elements are first grouped together in various ways so as to assign values to different attributes separately, and that this is followed by a process of perceptual synthesis in which the different attribute values are combined. Resulting from this two-stage process, the different attribute values are sometimes combined incorrectly, so that illusory conjunctions result (cf. Deutsch, 1975a, 1975b, 1981, 2004; Deutsch, Hamaoui, & Henthorn, 2007; Deutsch & Roll, 1976). Auditory scene analysis cannot, therefore, be regarded as the product of a single, internally coherent system, but rather as the product of multiple systems whose

6. Grouping Mechanisms in Music

185

outputs are sometimes inconsistent with each other (see also Hukin & Darwin, 1995a; Darwin & Carlyon, 1995). As a further issue, the grouping of sound elements in music involves not only the creation of low-level features such as tones, but also the conjunction of these features at higher levels so as to form intervals, chords, durational relationships, and rhythmic patterns, as well as phrases and phrase groups (see also Chapter 7). As we shall find, auditory grouping is the function of a highly elaborate and multifaceted system, whose complexities are becoming increasingly apparent.

II.

Fusion and Separation of Spectral Components

In this section, we consider the relationships between the components of a musical sound spectrum that lead us to fuse them into a unitary sound image and those that lead us to separate them into multiple sound images. In particular, we explore two types of relationship. The first is harmonicity. Natural sustained sounds, such as are produced by musical instruments and the human voice, are made up of components that stand in harmonic, or near-harmonic, relation; that is, their frequencies are integer, or near-integer multiples of the fundamental frequency. It is reasonable to expect, therefore, that the auditory system would exploit this feature so as to combine a set of harmonically related components into a single sound image. A second relationship that we explore is onset synchronicity. When components of a sound complex begin at the same time, they are likely to have originated from the same source; conversely, when they begin at different times, they are likely to have originated from different sources. As an associated issue, we explore temporal correspondences in the fluctuations of components in the steady-state portion of a sound. The importance of temporal relationships for perceptual fusion and separation was recognized by Helmholtz in his treatise On the Sensations of Tone (1859/1954), in which he wrote:
Now there are many circumstances which assist us first in separating the musical tones arising from different sources, and secondly, in keeping together the partial tones of each separate source. Thus when one musical tone is heard for some time before being joined by the second, and then the second continues after the first has ceased, the separation in sound is facilitated by the succession in time. We have already heard the first musical tone by itself and hence know immediately what we have to deduct from the compound effect for the effect of this first tone. Even when several parts proceed in the same rhythm in polyphonic music, the mode in which the tones of the different instruments and voices commence, the nature of their increase in force, the certainty with which they are held and the manner in which they die off, are generally slightly different for each.. . . When a compound tone commences to sound, all its partial tones commence with the same comparative strength; when it swells, all of them generally swell uniformly; when it ceases, all cease simultaneously. Hence no opportunity is generally given for hearing them separately and independently. (pp. 5960)

186

Diana Deutsch

A. Effects of Harmonicity
Musical instrument tones provide us with many examples of perceptual grouping by harmonicity. String and wind instruments produce tones whose partials are harmonic, or close to harmonic, and these give rise to strongly fused pitch impressions. In contrast, bells and gongs produce tones whose partials are nonharmonic, and these give rise to diffuse pitch impressions. The effect of harmonicity has been explored in numerous experiments using synthesized tones (Carlyon, 2004; Carlyon & Gockel, 2007; Darwin, 2005a; Darwin & Carlyon, 1995). How far can a single component of a complex tone deviate from harmonicity and still be grouped with the other components to determine perceived pitch? Moore, Glasberg, and Peters (1985) had subjects judge the pitches of harmonic complex tones, and they examined the effects of mistuning one of the harmonics to various extents. When the harmonic was mistuned by less than 3%, it contributed fully to the pitch of the complex. As the degree of mistuning increased beyond 3%, the contribution made by this component gradually decreased, and at a mistuning of 8%, the component made virtually no contribution to the pitch of the complex. The effect of a mistuned harmonic can, however, be made to vary by changing its relationship to the remainder of the complex (Darwin, 2005a). In one experiment, subjects were presented with a harmonic complex tone that contained one mistuned harmonic. When this harmonic was synchronous with the others, the perceived pitch of the complex was slightly shifted. However, when the mistuned harmonic entered sufficiently before the others, it no longer contributed to the pitch of the complex (see also Darwin & Ciocca, 1992; Ciocca & Darwin, 1999). Furthermore, when the complex was preceded by a sequence of four tones at the same frequency as the mistuned harmonic, the pitch shift again disappeared, indicating that the mistuned harmonic had formed a separate stream with the preceding tones. Also, when all the harmonics in the complex (including the mistuned one) were given a common vibrato, larger amounts of mistuning were needed to remove the contribution of the mistuned harmonic to the pitch of the complex, indicating that the common vibrato had caused the harmonics to be bound together more effectively (Darwin, Ciocca, & Sandell, 1994). Huron (1991b, 2001) has related findings on harmonicity and spectral fusion to polyphonic music. One objective of such music is to maintain the perceptual independence of concurrent voices. In an analysis of a sample of polyphonic keyboard works by J. S. Bach, Huron showed that harmonic intervals were avoided in proportion to the strength with which they promoted tonal fusion. He concluded that Bach had used this strategy in order to optimize the salience of the individual voices in these compositions. Composers have also focused on the creation of perceptual fusion of simultaneous tones so as to give rise to unique timbres. For example, at the opening of Schuberts Unfinished Symphony the oboe and clarinet play in unison, with the result (assuming the performers play in strict synchrony) that listeners hear a fused sound with a unique timbre that appears to be emanating from a single instrument. More recently, composers have frequently experimented with sounds produced by

6. Grouping Mechanisms in Music

187

several instruments playing simultaneously, such that the individual instruments lost their perceptual identities and together produced a single sound impression. For example, Debussy and Ravel made extensive use of chords that approached timbres. Later composers such as Schoenberg, Stravinsky, Webern, and Varese often employed highly individualized structures, which Varese termed sound masses (Erickson, 1975), and here tone combinations that stood in simple harmonic relation were particularly useful. To return to laboratory experiments, a number of studies have found that simultaneous speech patterns could be more easily separated out perceptually when they were built on different fundamentalsin general, the amount of useful perceptual separation reached its maximum when the fundamentals differed by roughly one to three semitones (Assmann & Summerfeld, 1990; Scheffers, 1983). Furthermore, formants built on the same fundamental tended to be grouped together so as to produce a single phonetic percept, whereas a formant built on a different fundamental tended to be perceived as distinct from the others (Broadbent & Ladefoged, 1957). The relationship of these findings to musical tones was explored by Sandell and Darwin (1996), who generated simultaneous tone pairs taken from five different orchestral instruments (flute, Bw clarinet, cor anglais, French horn, and viola). The authors found that subjects were better able to separate and identify the tones when their pitches differed by a semitone. The number of sources perceived by the listener provides a further measure of grouping. For example, Moore, Glasberg, and Peters (1986) found that when a single component of a harmonic complex was slightly mistuned from the others, it was heard as standing apart from them. Interestingly, less mistuning is required to produce the impression of multiple sources than to produce other effects. For example, a slightly mistuned harmonic in a sound complex might be heard as distinct from the others, yet still be grouped with them in determining perceived pitch (Moore et al., 1986) or vowel quality (Darwin, 1981, Gardner, Gaskill, & Darwin, 1989). As argued by Darwin and Carlyon (1995), this type of disparity indicates that perceptual grouping involves a number of different mechanisms, which depend on the attribute being evaluated, and these mechanisms do not necessarily employ the same criteria. This issue is discussed further in Section VI, where it is shown that, in listening to simultaneous sequences of tones, separate, and sometimes inconsistent, decision mechanisms are employed to determine the perceived pitch, location, loudness, and timbre of each tone, so that illusory conjunctions result.

B. Effects of Onset Synchronicity


So far we have been considering sounds whose components begin and end at the same time, and we have been exploring the spectral relationships that are conducive to perceptual fusion. In real musical situations, temporal factors also come into play. One such factor is onset synchronicity. The importance of this factor can be shown in a simple demonstration, in which a harmonic series is presented so that its components enter at different times. For example, we can take a series that

188

Diana Deutsch

is built on a 200-Hz fundamental. Suppose we begin with the 200-Hz component sounding alone, then 1 s later add the 400-Hz component, then 1 s later add the 600-Hz component, and so on, until all the components are sounding together. As each component enters, its pitch is initially heard as forming a distinct entity, and then it gradually fades from perception, so that finally only a pitch that corresponds to the fundamental is perceived. Even a transient change in the amplitude of a component can enhance its perceptual salience. If a particular harmonic of a complex tone is alternately omitted and restored, this can cause it to stand out as a pure tone, separately audible from the remainder of the complex, and it can even be heard for a short time after being turned back on (Hartmann & Goupell, 2006; Houtsma, Rossing, & Wagenaars, 1987). Darwin and Ciocca (1992) have shown that onset asynchrony can influence the contribution made by a mistuned harmonic to the pitch of a complex. They found that a mistuned harmonic made less of a contribution to perceived pitch when it led the others by more than 80 ms, and it made no contribution when it led the others by 300 ms. Later, Ciocca and Darwin (1999) observed that a mistuned harmonic made a larger contribution to the pitch of a target sound when it occurred following the onset of the target than when it preceded its onset. Onset asynchrony can also affect the contribution of a harmonic to the perceived timbre of a complex. Darwin (1984) found that when a single harmonic of a vowel whose frequency was close to that of the first formant led the others by roughly 30 ms, there resulted an alteration in the way the formant was perceived; this alteration was similar to that which occurred when the harmonic was removed from the calculation of the formant. Interestingly, Darwin and colleagues found that the amount of onset asynchrony that was needed to alter the contribution of a harmonic to perceived pitch was greater than was needed to alter its contribution to perceived vowel quality. Hukin and Darwin (1995a) showed that this discrepancy could not be attributed to differences in signal parameters, but rather to the nature of the perceptual task in which the subject was engaged; again arguing, as did Darwin and Carlyon (1995), that such disparities reflect the operation of multiple decision mechanisms in the grouping process that can act independently of each other. Onset asynchrony has been shown to have higher level effects also. In one experiment, Bregman and Pinker (1978) presented subjects with a two-tone complex in alternation with a third tone, and they studied the effects of onset-offset asynchrony between the simultaneous tones. As the degree of onset asynchrony increased, the timbre of the complex tone was judged to be purer, and the probability increased that one of the tones in the complex would form a melodic stream with the third tone (see also Deutsch, 1979, discussed in Section VI,A). So far, we have been considering the effects of onset asynchrony on the grouping of components of single complex tones; however, asynchronies also influence the grouping of entire tone complexes. Specifically, when two complex tones are presented together, they are perceptually more distinct when their onsets are asynchronous. Rasch (1978) presented subjects with simultaneous pairs of complex

6. Grouping Mechanisms in Music

189

tones, and found that detection of the higher tone of a pair was strongly affected by onset asynchrony: Each 10 ms of delay of the lower tone was associated with roughly a 10-dB reduction in detection threshold, and at a delay of 30 ms, the threshold for perception of the higher tone was roughly the same as when it was presented alone. Further, when the onsets of the higher and lower tones were synchronous, a single fused sound was heard; yet when onset disparities were introduced, the tones sounded very distinct perceptually. Rasch (1988) later applied these findings to live ensemble performances. He made recordings of three different trio ensembles (string, reed, and recorder) and calculated the onset relationships between tones that were nominally simultaneous. He obtained asynchrony values that ranged from 30 to 50 ms, with a mean asynchrony of 36 ms. Relating these findings to those he had obtained earlier on perception, Rasch concluded that such onset asynchronies enabled the listener to hear the simultaneous tones as distinct from each other. According to this line of reasoning, such asynchronies should not be considered as performance failures, but rather as characteristics that are useful in enabling listeners to hear concurrent voices distinctly. From these findings, one would expect large amounts of asynchrony to be conducive to the separation of voices in an ensemble. One might therefore hypothesize that compositional practice would exploit this effectat least in polyphonic music, where it is intended that the individual voices should be distinctly heard. Evidence for this hypothesis was obtained by Huron (1993, 2001) in an analysis of Bachs 15 two-part inventions. He found (controlling for duration, rhythmic order, and meter) that for 11 of the inventions, no other permutations of the rhythms of the voices would have produced more onset asynchrony than occurred in Bachs actual music. For the remaining 4 inventions, values of asynchrony were still significantly higher than would be expected from chance. Huron concluded that Bach had deliberately produced such onset asynchronies so as to optimize the perceptual salience of the individual voices in these compositions.

C. Auditory Continuity
Auditory continuity is a dramatic effect that can result from temporal disparities within tone complexes. This effect is important to the interpretation of our natural environment, where sound signals are frequently masked by other sounds. In order to maintain a stable representation of the auditory world, our perceptual system needs to restore the masked portions of each sound, by estimating their characteristics based on other sounds that occur before, during, and after the masking sound. The continuity effect is widespread, and has been shown to occur in nonhuman species such as cats (Sugita, 1997), monkeys (Petkov, OConnor, & Sutter, 2003), and birds (Braaten & Leary, 1999; Seeba & Klump, 2009), as well as in human listeners (Houtgast, 1972; Miller & Licklider, 1950; Vicario, 1960; Warren, Obusek, & Ackroff, 1972). Consider the visual analogue shown in the upper portion of Figure 2. Line A could, in principle, be viewed in terms of three components: a line to the left

190

Diana Deutsch

Figure 2 Visual analogue of an auditory continuity effect. Line A in the upper illustration could, in principle, be seen as having three componentsa line to the left of the rectangle, a line to its right, and a line that forms part of the rectangle itself. However, it is instead seen as a single, continuous line. This effect is weaker in the lower illustration, in which the rectangle is wider, and the lines to its left and right are shorter. Adapted from Vicario (1982).

of the rectangle, a line to its right, and a line that forms part of the rectangle itself. However, our visual system instead treats all three components as a single line, which is independent of the remaining parts of the rectangle. Vicario (1982) produced a musical equivalent of this demonstration. He generated a chord that consisted of components corresponding to C4, Dx4, Fx4, A4, C5, Dx5, and Fx5; with A4 both preceding and following the other components. Just as line A in Figure 2 is seen as continuing through the rectangle, so a pitch corresponding to A4 is heard as continuing right through the chord. The continuity effect is sensitive to the precise temporal parameters of the various components. To return to Vicarios visual analogue, when the lines forming the rectangle are lengthened and the lines to its left and right are shortened, as in the lower portion of Figure 2, the impression of continuity is reduced. Similarly, when the duration of the lengthened component of the chord is reduced, and the duration of the full chord is increased, the impression of continuity is diminished. An interesting demonstration of auditory continuity was provided by Dannenbring (1976), who generated a pure-tone glide that rose and fell repeatedly. In some conditions, the glide was periodically interrupted by a loud broadband noise; however, it was perceived as though continuous. In contrast, when the glide was periodically broken, leaving only silent intervals during the breaks, listeners instead heard a disjunct series of rising and falling glides. Visual analogues of these two conditions, and their perceptual consequences, are shown in Figure 3. Sudden amplitude drops between signals and intervening noise bursts may reduce, or even destroy, continuity effects (Bregman & Dannenbring 1977; Warren et al., 1972); however, this does not necessarily occur. For example, tones produced by plucked instruments are characterized by rapid increases followed by decreases in

6. Grouping Mechanisms in Music

191

Figure 3 Visual illustration of an auditory continuity effect using gliding tones. Adapted from Bregman (1990), which illustrates an experiment by Dannenbring (1976).

amplitude. In music played by such instruments, when the same tone is rapidly repeated many times, and is periodically omitted and replaced by a different tone, the listener may perceptually generate the missing tone. Many examples of this phenomenon occur in 19th- and 20th-century guitar music, such as Tarregas Recuerdos de la Alhambra (Figure 4), and Barrioss Una Limosna por el Amor de Dios. Here the strong expectations set up by the rapidly repeating notes cause the listener to hear these notes even when they are not being played. Interestingly, at the end of the Barrios piece, the tempo is gradually slowed down, so that the gaps in the repeating presentations become apparent. In this way, the listener is drawn to realize that the gaps had in fact been there, although imperceptibly, throughout the work. Remijn, Nakajima, and Tanaka (2007) explored auditory continuity across a silent interval from a different perspective. They presented subjects with a pattern consisting of two crossing frequency glides of unequal duration that shared a silent gap of 40 ms or less at the crossing point. The gap was perceived to occur only in the shorter glide, while the longer glide was perceived as continuous. Both long- and short-term memory can influence the strength and nature of the auditory continuity effect (Vicario, 1973; Warren, 1983). In one experiment, Sasaki (1980) generated melodic patterns in which certain tones were omitted and replaced by loud noise bursts. Under some circumstances, subjects heard the missing tone appear through the noise. This percept was most likely to occur when the omitted tone was predictable from the musical context; for example, when it formed part of a well-known melody. A short-term context effect was demonstrated by Ciocca and Bregman (1987), who presented subjects with a gliding tone that was interrupted by a noise burst. When the entering and exiting portions of the glide fell either in the same frequency range or on a common trajectory, subjects tended to hear the glide as continuing through the noise. Later, Tougas and Bregman (1990) generated two simultaneous glides, one ascending and the other descending, with the two crossing in the middle. Previous studies had shown that global frequency proximity strongly

192

Diana Deutsch

Log frequency (semitones)

Time

Figure 4 The beginning of Recuerdos de la Alhambra, by Tarrega. Although the tones are presented one at a time, two parallel lines are perceived, organized in accordance with pitch proximity. Adapted from Deutsch (1996).

influenced how crossing pitch patterns were perceived (Deutsch, 1975a, 1975b; Tougas & Bregman, 1985; Van Noorden, 1975; see also Section VI). As expected from these findings, Tougas and Bregman (1990) observed that frequency proximity dominated over trajectory in determining the type of perceptual restoration that was obtained: Subjects tended to hear a higher glide that fell and then rose again, together with a lower glide that rose and then fell again, with the two meeting in the middle. Continuity effects can be influenced by more complex factors. In one experiment, Darwin (2005b) had subjects make judgments on complex tones that alternated with band-pass noise. He found that a quiet complex tone was heard as continuous when all its harmonics fell within the frequency range of the noise band. This impression of continuity was substantially reduced when harmonics were added that were outside the range of the noise; however, it was largely restored when the additional components produced a different fundamental frequency. Darwin concluded that continuity judgments are made on entire simultaneously grouped objects, rather than being determined by local frequency criteria (see also McDermott & Oxenham, 2008). In other experiments, Riecke, Mendelsohn, Schreiner, and Formisano (2009) demonstrated that continuity illusions can be influenced by preceding sound patterns. Specifically, they found that whether or not the same perceptually ambiguous

6. Grouping Mechanisms in Music

193

glide was heard as continuous could be modulated by the loudness and perceived continuity of preceding glides. These context effects were related less to the spectra of the preceding sounds than to how they had been interpreted by the listener. The brain mechanisms underlying the continuity illusion have also been explored. Petkov, OConnor, and Sutter (2007) studied responses in the auditory cortex of macaque monkeys to tones that were interrupted by a loud noise. They found that some neurons responded to discontinuous tones that were interspersed with noise as though the tones were continuous (see also Petkov & Sutter, 2011).

D. Effects of Vibrato
Natural sustained sounds, such as those generated by musical instruments and the singing voice, constantly undergo small frequency fluctuations that preserve the ratios formed by their components. It has been surmised that the auditory system uses such coherent frequency modulation (FM) as a cue for grouping spectral components together, and conversely uses incoherent FM as a cue for separating them out perceptually (Bregman, 1990). Indeed, composers such as Chowning (1980) and McNabb (1981) have produced informal demonstrations that coherent vibrato enhances perceptual fusion when imposed on synthesized singing voices or musical instrument tones. Later Darwin, Ciocca, and Sandell (1994) found that a mistuned harmonic was more likely to contribute to the pitch of a complex tone when it was given a common vibrato with the other harmonics. The issue with respect to incoherent FM, however, is theoretically complex: Because information concerning FM is severely degraded in reverberant environments, the reliance on incoherent FM as a cue for perceptual separation could cause the listener to separate out components that should instead be grouped together. Furthermore, incoherent FM necessarily causes the frequency relationships between components of a tone to depart from harmonicity. Because the perceptual system already uses such departures as cues for perceptual segregation (as discussed earlier), the utility of incoherent FM as an additional cue is debatable. The experimental evidence on this issue is also complex. McAdams (1989) explored the effect of vibrato on the perceptual separation of three simultaneous sung vowels which were built on different fundamentals. He found that when target vowels were given a vibrato, this increased their perceptual salience. However, the perceived salience of target vowels was not affected by whether the nontarget vowels were given a vibrato. Other negative findings were obtained by Carlyon (1991, 1992), who found that subjects were insensitive to incoherent vibrato when it was independent of departures from harmonicity. When the components of tones stood in nonharmonic relation, listeners were unable to judge whether they were modulated coherently or incoherently with each other. Such negative findings raise the question of why vibrato can nevertheless enhance a vowels perceptual salience. McAdams (1984) pointed out that when the harmonics of a vowel are given a vibrato, they also undergo amplitude modulation (AM) that traces the vowels spectral envelope. In this way, the listener is provided with more

194

Diana Deutsch

complete information concerning the vowels identity, and such spectral tracing might therefore be responsible for the enhanced perceptual salience of vowels with vibrato. However, Marin and McAdams (1991) found that although vowels with vibrato were heard more saliently, spectral tracing was not a factor here. As an alternative explanation for the enhanced prominence of frequency-modulated vowels, we may conjecture that neural units involved in the attribution of vowel quality are more strongly activated by frequency-modulated sounds than by unmodulated ones.

E. Effects of Amplitude Modulation


Because many natural sounds consist of spectral components whose amplitudes rise and fall in synchrony, one might conjecture that coherent AM would be used by the auditory system as a cue for perceptual fusion, and incoherent AM would be used as a cue for perceptual separation. On the other hand, the partials of many musical instrument tones do not rise and fall in synchrony. So the use of AM incoherence as a cue for perceptual separation could cause the listener to erroneously separate out components that should be perceptually fused together. The experimental evidence on this issue is also equivocal. Bregman, Abramson, Doehring, and Darwin (1985) concluded that coherent AM could promote perceptual fusion; however, the modulation rates they used were high, and at slower rates, convincing evidence that coherent AM leads to perceptual fusion has been difficult to obtain (Darwin & Carlyon, 1995).

F. Effects of Spatial Separation


Because all the components of a sound necessarily originate from a common location, and the components of different sounds originate from different locations, one might expect that the spatial origins of components would strongly influence how they are perceptually grouped together. The issue arises, however, of how the spatial origin of a component should be inferred in the first place. In natural environments, sound waves are subject to numerous distortions as they travel from their sources to our ears. So if we were to rely on first-order localization cues alone (such as differences in amplitude and phase at the two ears), we would risk separating out components that should instead be combined perceptually. Given this line of reasoning, we might expect the auditory system not to use first-order localization cues as primary bases for grouping, but instead to use them only when other supporting cues are present. Indeed, we can go further and hypothesize that factors such as harmonicity and onset synchronicity, which indicate that components have originated from a common source, might cause us to hear these components as arising from the same spatial location (see also Section VI). Experimental evidence supporting this view has been obtained from studies in which different components of a sound complex were presented to each ear. Beerends and Houtsma (1989) had subjects identify the pitches of two complex tones, when their partials were distributed across ears in various ways. They found that pitch identification was only weakly affected by the way the partials were

6. Grouping Mechanisms in Music

195

distributed, showing that the perceptual system was treating them as coming from the same source. A related effect was found by Darwin and Ciocca (1992), who observed that the contribution of a single mistuned harmonic to the pitch of a complex tone was almost as large when this harmonic was delivered to the opposite ear as when it was delivered to the same ear as the other harmonics. Related effects have been found for the perception of speech sounds. Broadbent and Ladefoged (1957) presented listeners with the first two formants of a phrase, with one formant delivered to each ear. When the two formants were built on the same fundamental, the listeners were able to identify the speech signal, and they also tended to hear a single voice, and so were fusing the information from the two ears into a single perceptual image. Later, Hukin and Darwin (1995b) investigated the degree to which a single component contributed to the perceived quality of a vowel when it was presented to the ear opposite the remaining components, and found that this difference in ear of input had only a small effect on perceived vowel quality. Support has also been obtained for the conjecture that other grouping cues, such as harmonicity and onset asynchrony, can influence the perceived spatial origin of a component of a sound complex. For example, Hill and Darwin (1993) found that mistuning a harmonic in a sound complex caused its apparent location to be determined separately from the remainder of the complex. We shall see in Section VI that when two sequences of tones are presented simultaneously, one to each ear, a number of factors influence whether or not ear of input is used as a localization cue, and that these factors also influence the perceived spatial origins of the individual tones in a complex.

G. Effects of Statistical Regularities


Listeners are able to form groupings from repeating spectrotemporal structures that are embedded in changing acoustic input, even in the absence of strong cues such as harmonicity, onset synchronicity, and so on. As a result, we can identify complex sounds that are present in a mixture, provided that they occur repeatedly. McDermott, Wrobleski, and Oxenham (2011) synthesized novel sounds that shared some of the structure of natural sounds, but lacked strong grouping cues. In general, subjects were unable to identify such a sound when it was presented in a single mixture. However, when a series of sound mixtures was presented, each containing the same target sound, subjects were able to identify the target reliablya significant benefit was even obtained from only two presentations of the target sound mixed in with others.

III.

Larger-Scale Groupings

So far, we have been focusing on situations in which single tone complexes are presented, and we have identified various cues that are used by the listener to sort their

196

Diana Deutsch

components into groupings. We now turn to the situation in which a sequence of tones is presented instead. Here the auditory system abstracts relationships between tones in a sequence, and uses these relationships as additional grouping cues. One cue that we use here is pitch proximity: We tend to form sequential linkages between tones that are close in pitch and to separate out those that are further apart. Researchers have frequently drawn an analogy with apparent motion in vision: When two lights that are in spatial proximity are flashed on and off in rapid succession, we obtain the illusion that a single light has moved from one location to the other. A second cue is temporal proximity: When pauses are placed between tones in a sequence, we use these as cues for grouping the tones into subsequences. A third cue is similarity of sound quality: When different types of instrument play together, we tend to form linkages between tones that are similar in timbre. We also invoke other principles, such as good continuation and common fate, in making grouping decisions. In addition, high-level factors are involved, such as memory, attention, and the processing of information at high levels of abstraction. We first explore the separation of a sequence of single tones into different groupings. Two issues are considered. The first, which has been the subject of considerable research, concerns the ways in which we group a sequence of tones into separate and parallel streams in a polyphonic texture. The second issue, which is of considerable importance but has been the subject of less research, is how we divide a musical stream into coherent segments that are separated by temporal boundaries (see also Chapter 7). Finally, we explore the grouping of simultaneous sequences of tones.

IV.

Auditory Streaming and Implied Polyphony

A. Streaming by Pitch Proximity


In general, when a sequence of tones is presented at a rapid tempo, and the tones are drawn from two different pitch ranges, the listener perceives two melodic lines in parallel, one corresponding to the higher tones and the other to the lower ones. The separation of rapid sequences into different streams is widespread, and occurs in human infants (Demany, 1982; McAdams & Bertoncini, 1997; Winkler et al., 2003) and in nonhuman species, such as monkeys (Izumi, 2002), birds (MacDougall-Shackleton, Hulse, Gentner, & White 1998), and fish (Fay, 1998). Auditory streaming is frequently exploited in music using the technique of pseudo-polyphony, or compound melodic line. Baroque composers such as Bach and Telemann frequently employed this technique, particularly in works for solo recorder or string. In more recent times, the technique has been employed to striking effect by composers of classical and romantic guitar music, in which the notes produced by plucked string occur at very rapid tempo. The passage from Tarregas Recuerdos de la Alhambra shown in Figure 4 provides an example. In this figure, the passage is also represented with pitch and time mapped into the vertical and

6. Grouping Mechanisms in Music

197

horizontal dimensions of visual space. It can be seen that two separate lines emerge in the visual representation, and these correspond to the two melodic lines that are perceived by the listener. This phenomenon of perceptual dissociation has been investigated in a number of studies. Miller and Heise (1950) presented subjects with two pure tones at different frequencies (A and B) such that they alternated at a rate of 10 per second, forming an ABAB pattern (Figure 5). When the pitch difference between the tones was small (about one or two semitones), subjects heard the sequence as a trill. However, when this pitch difference was large, subjects instead heard the sequence as two interrupted and unrelated tones. In a further experiment, Heise and Miller

Tone pattern Log frequency (semitones) B A B A B B A B B A B B

Perception

A 0

A 600

200 400 Time (ms)

Log frequency (semitones)

B A A A

B A

A 0

A 600

200 400 Time (ms)

Figure 5 Patterns most frequently employed to study the perceptual segregation of a rapid sequence of tones. The patterns either alternate between the two pitches A and B (forming an ABAB pattern) or they consist of repeating triplets (forming an ABA_ triplet pattern). When the pitch distance between tones A and B is small, a single stream of related tones is perceived. When the pitch distance is large, with continued listening two unrelated pitch streams are perceived, one consisting of the low tones and the other of the high tones.

198

Diana Deutsch

(1951) employed rapid sequences of tones that were composed of several different pitches. When one of the tones in a sequence differed sufficiently in pitch from the others, it was heard as in isolation from them. Related phenomena have been demonstrated by Dowling and colleagues (Dowling, 1973b; Dowling, Lung, & Herrbold, 1987). In one paradigm, two wellknown melodies were presented at a rapid tempo, such that the tones from each melody occurred in alternation. The melodies were difficult to identify when their pitch ranges overlapped closely, but easy to identify when their pitch ranges differed. In another paradigm, Dowling presented an unfamiliar target melody, followed by a probe melody that was interleaved with a distractor sequence. Subjects judged whether the two melodies were the same or different, and their judgments improved with increasing pitch separation between the probe melody and the distractor tones. Considering Dowlings second paradigm, Bey and McAdams (2003) studied the effects of interleaving the distractor tones with either the target or the probe melody. In one condition, the target melody was first presented mixed with distractor tones, and this was followed by a probe melody that was presented alone (Figure 6). In a second condition, the target melody was first presented alone, and this was followed by the probe melody mixed with distractors. In the third condition, the target was again first presented alone, but this time in transposed form. In a control condition, both the target and the probe melodies were presented alone. As expected from previous findings, in all conditions, performance improved with increasing frequency separation between the target and the distractors. However, performance was enhanced overall when the target was first presented alone, and
(a) Experimental condition First sequence: target melody (filled) + distractor sequence (open) Frequency Second sequence: probe melody

Time (b) Control condition First sequence: target melody Second sequence: probe melody

Frequency

Figure 6 Examples of patterns employed to study perception of tone sequences that were interleaved with distractor tones. In the experimental condition depicted here, the target melody was presented interleaved with distractors, and in the control condition, the target and probe melodies were both presented alone. In both conditions, the target melody was followed by a probe melody, and subjects judged whether or not the probe differed from the target. Adapted from Bey and McAdams (2003).

Time

6. Grouping Mechanisms in Music

199

the probe was interleaved with distractors. In this latter condition, the subjects were better able to encode the sequence in memory before judging the mixed sequence. Interestingly, this performance enhancement was not as large when the target was presented in transposed form, indicating that absolute pitch level was also involved in the process. Van Noorden (1975) carried out a detailed study on the influence of pitch proximity and tempo in the building of perceptual streams. Subjects were presented with sequences consisting of two tones in alternation, and they attempted to hear either a single coherent stream or two separate streams. Two boundaries were determined by these means. The first was the threshold frequency separation as a function of tempo that was needed for the listener to hear a single stream. The second was the threshold frequency separation as a function of tempo when the listener was attempting to hear two streams. As shown in Figure 7, when the subjects were attempting to hear a single stream, decreasing the tempo from 50 to 150 ms per tone increased the range of frequency separation within which one stream could be heard from 4 to 13 semitones. However, when the subjects were instead attempting to hear two streams, decreasing the tempo had little effect on performance. Between these two boundaries, there was a large region in which the listener could alter his listening strategy at will, and so hear either one or two streams. So within this region, attention played a role in determining how the sequence was perceived. The preceding experiments employed either pure tones or harmonic complex tones in which pitch (fundamental frequency) and spectrum were co-varied. We can then ask whether differences in pitch with spectrum held constant, or differences in spectrum with pitch held constant, can alone give rise to streaming. Singh (1987) and Bregman, Liao, and Levitan (1990) explored the streaming of sequences in which tones differed either in spectral envelope, or fundamental frequency, or both; they found that both these factors contributed to stream segregation. Considering spectral region alone, Van Noorden (1975) found that listeners perceptually segregated sequences consisting of a pure tone alternating with a complex tone with the same fundamental frequency, or of two complex tones with

15 Tone interval I (semitones)

10

Figure 7 Temporal coherence boundary (o), and fission boundary (x) as a function of the frequency relationship between the alternating tones and presentation rate. Adapted from Van Noorden (1975).

50 100 150 Tone repetition time T (ms)

200

Diana Deutsch

the same fundamental frequency but different harmonics, showing that spectral region here played a role. Considering fundamental frequency alone, Vliegen and Oxenham (1999) employed sequences of tones consisting only of high harmonics, with spectral envelope held constant. Subjects segregated these sequences on the basis of fundamental frequency, and the amount of segregation was comparable to that found for pure tones. However, Vliegen, Moore, and Oxenham (1999) observed that spectral differences also contributed to stream segregation (see also Carlyon & Gockel, 2007; Grimault, Micheyl, Carlyon, Arthaud, & Collett, 2000). The effect of spectrum is an example of streaming by timbre, which is explored in the next section.

B. Streaming by Timbre
The grouping of sounds on the basis of sound quality, or timbre, is an example of the Gestalt principle of similarity: Just as we perceive the array in Figure 1b as four columns, two formed by the filled circles and two by the unfilled ones, so we group together tones that are similar in timbre and separate out those that are dissimilar. As a result, when different instruments play in parallel, we may form groupings based on their timbres even when their pitch ranges overlap heavily. An example is given in Figure 8, which is taken from Beethovens Spring Sonata for violin and piano. Here the listener perceives two melodic lines that correspond to the tones played by each instrument, rather than linking the tones in accordance with pitch proximity. A striking consequence of this streaming tendency was demonstrated by Warren, Obusek, Farmer, and Warren (1969). These authors generated a sequence of four unrelated sounds that were presented repeatedly without pause. The sounds, each 200 ms in duration, consisted of a high tone, a hiss (noise burst), a low tone, and a buzz (square wave). At this presentation rate, subjects were unable to name the orders in which the sounds occurred. For correct ordering to be achieved, the duration of each sound needed to be longer than 500 ms. Another consequence of streaming by timbre was demonstrated by Wessel (1979). He presented subjects with a repeating pattern consisting of a three-tone ascending pitch line, with successive tones composed of alternating timbres that were defined by their spectral energy distribution. When the timbral difference
Violin

Piano

Figure 8 Passage from the beginning of the second movement of Beethovens Spring Sonata for violin and piano. The tones played by the two instruments overlap in pitch; however, the listener perceives two melodic lines in parallel, which correspond to those played by each instrument. This reflects perceptual grouping by similarity.

6. Grouping Mechanisms in Music

201

between successive tones was small, listeners heard the pattern as composed of ascending lines. However, when the timbral difference was large, listeners linked the tones together on the basis of timbre, and so heard two interwoven descending lines instead. In a related experiment, Cusack and Roberts (2000) employed an interleaved melodies task, and found that target sounds were more easily separated from distractors when they differed in timbre. Since timbre is multidimensional in nature (see McAdams, Chapter 2), we can ask which of its aspects are most conducive to streaming. Iverson (1995) presented subjects with sequences of orchestral tones that were equated for pitch and loudness, and they rated how strongly the tones were perceptually segregated from each other. Multidimensional scaling analyses of their judgments indicated that both static and dynamic timbral attributes were involved. In a further experiment, Iverson presented subjects with interleaved melodies, and had them attempt to recognize the melodies on the basis of a target timbre. Again, judgments were influenced by both static and dynamic timbral attributes. Attack time was found to be influential in two ways: tones with rapid attacks segregated from each other more strongly, as did tones with contrasting attack times. Later, Bey and McAdams (2003) had subjects listen to a target melody that was interleaved with a distractor sequence, followed by a probe melody that they compared with the target. Synthesized instrument sounds were employed, and these had earlier been organized perceptually in terms of distance along a multidimensional timbral space. The tones had been found to vary along three dimensions: The first dimension related to spectral centroid, the second to attack quality, and the third to variations or irregularities in the spectral envelope. Melody identification improved with increasing distance between the target and distractor sequence in this multidimensional space. In a further study, Cusack and Roberts (2004) explored the effect on streaming of dynamic variations in the frequency spectrum. They generated periodic tones that differed in fundamental frequency, with the total amount of spectral flux held constant. Sequential patterns of tones were more likely to segregate perceptually when they contained different patterns of spectral variation, particularly variation in frequency centroid over time. The issue of timbre as a grouping cue is revisited in Section VI.

C. Building and Resetting of Auditory Streams


A number of studies have shown that the splitting of tone sequences into separate streams based on pitch builds with repetition. Van Noorden (1975) compared the signal parameters that were required for listeners to hear a single connected series of tones, using sequences of different types. Compared with two-tone sequences, bidirectional three-tone sequences needed to be presented at a considerably slower tempo in order for a connected series to be perceived. For long repetitive sequences, the tempo needed to be slower still (Figure 9). Other studies have confirmed that the percept of two streams rather than one builds over time. Anstis and Saida (1985) found that stream segregation built

202

Diana Deutsch

Mean tone interval I (semitones)

25 20 15 10 5 0 0

Figure 9 Temporal coherence boundary for two-tone (Curve 3), three-tone unidirectional (Curve 1), three-tone bidirectional (Curve 2), and continuous (Curve 4) sequences. Adapted from Van Noorden (1975).

1 2 3 4 50

LINEAR ANGULAR

100 150 200 250 300 Tone repetition time T (ms)

steeply during the first 10 s of sequence presentation, continued to build at a decreased rate thereafter, and even appeared incomplete when the sequence had continued for 60 s. Roberts, Glasberg, and Moore (2002) also found that stream segregation built rapidly during the first few seconds, and this was followed by a more gradual buildup that appeared incomplete even after 2530 s. The streaming process can be reset by various manipulations. Bregman (1978) presented listeners with a repeating sequence that consisted of two high tones together with a single low tone. When the sequence split perceptually into two streams, listeners perceived two high tones in alternation, together with a single low tone that was steadily repeated. The number of tones that were packaged between 4-s periods of silence was varied, and listeners adjusted the speed of the sequence until the point of splitting was determined. As shown in Figure 10, as the number of tones in the package increased, the tempo required for perception of separate streams decreased. Beauvois and Meddis (1997) explored this issue further by having the subjects listen to an induction sequence of repeating tones (AAAA . . . ) that were designed to produce the tendency to hear an A stream (see Figure 5). They then presented a silent interval, followed by a short ABAB test sequence. The tendency to hear the sequence as segregated into A and B streams decreased with increasing duration of the silent interval (see also Snyder, Carter, Lee, Hannon, & Alain, 2008). The preceding findings lead to the further conjecture that the streaming mechanism can be reset, not only by the interpolation of silent intervals, but also by other changes in the stimulus pattern. As a test of this conjecture, Anstis and Saida (1985) presented listeners with sequences of ABAB tones in one ear, so that stream segregation developed. They then switched the sequence to the other ear, and this produced a reduction in streaming. The authors concluded that the streaming mechanism was reset by the change in signal locationa conclusion that was later supported by Rogers and Bregman (1993, 1998) who also produced resetting by an abrupt increase in loudness (see also Roberts, Glasberg, & Moore, 2008).

6. Grouping Mechanisms in Music

203

275 Splitting threshold (tone duration in ms)

250

Figure 10 Threshold for stream segregation as a function of number of tones per package. Two high tones were presented in alternation with a single low tone. Adapted from Bregman (1978).

225

200

175

150

16

Log number of tones in package

D. Streaming and Perception of Temporal Relationships


One consequence of the formation of separate perceptual streams is that temporal relationships between elements of different streams become difficult to process. This has been shown in several ways. Bregman and Campbell (1971) presented a repeating sequence consisting of six tones: three from a high pitch range and three from a low one. When the tones occurred at a rate of 10 per second, it was difficult for listeners to perceive a pattern of high and low tones that was embedded in the sequence. In a related experiment, Dannenbring and Bregman (1976) alternated two tones at high speeds so that they formed separate perceptual streams, and found that the tones from the two streams appeared to be overlapping in time. Using a different paradigm, Van Noorden (1975) studied the detection of temporal displacement of a tone that alternated continuously with another tone of different frequency. Using a basic pattern that consisted of two tones at frequencies A and B, such that they formed repeating ABA_ triplets (Figure 5), he produced different values of temporal displacement between the A and B tones. As the tempo of the sequence increased, the threshold for detecting temporal displacement between tones A and B also increased. This rise in threshold was substantial when the tones were widely separated in frequency, but only slight when their frequencies were similar. Such deterioration in temporal processing was considerably larger for long repetitive sequences than for two-tone sequences, indicating that it was associated with the process of stream formation (Figure 11). Interestingly, impaired sensitivity to temporal relationships between alternating tones has also been found when harmonic complex tones formed segregated streams based on differences in either spectrum or fundamental frequency (Vliegen et al., 1999).

204

Diana Deutsch

50

40

Figure 11 x Just noticeable displacement T/ T of the second tone of a two-tone sequence as a function of tone interval I. Just noticeable displacement T/ T of one tone in a continuous sequence of alternating tones as a function of tone interval I. Adapted from Van Noorden (1975).
G

T/ T (%)

30

20

10

10 20 I (semitones)

30

Van Noorden (1975) showed that the loss of temporal information resulting from stream segregation can have profound effects on the way a sequence is perceived. In an intriguing sound demonstration, he presented listeners with a continuous ABA_ triplet pattern, and gradually altered the pitch relationship between the tones. When tones A and B were close in pitch, a clear galloping rhythm was heard, reflecting the temporal relationships between successive tones. However, as the pitch difference between the tones increased, the galloping rhythm disappeared, and two unrelated temporal patterns were heard instead, one formed of the A tones and the other of the B tones. Later, Roberts et al. (2008) found that when the streaming process was reset by a change in signal parameters, so that a single stream was again perceived, judgments of the relative timing of temporally adjacent tones of different frequency improved.

E. Streaming by Amplitude and Amplitude Modulation


Amplitude can act as a cue to streaming under some conditions. For example, Dowling (1973b) found that loudness differences increased the perceptual distinctiveness of interleaved melodies. Van Noorden (1975) studied the perception of sequences consisting of tones of identical frequency that alternated between two different amplitudes. A sequence was heard as a single coherent stream when the amplitude difference between the tones was smaller than 5 dB, but as two separate streams when this amplitude difference was larger. With very large amplitude differences, auditory continuity effects were produced, so that the softer tone was heard as continuing through the louder one. Grimault, Bacon, and Micheyl (2002) carried out a study to determine whether rate of amplitude modulation (AM) might serve as a basis for streaming in the absence of other cues. When subjects were presented with temporal sequences that

6. Grouping Mechanisms in Music

205

consisted of bursts of broadband noise at alternating AM rates, they perceived a single perceptual stream when the difference in AM rate was less than 0.75 octave, but two separate streams when this difference was greater than an octave.

F. Effects of Short-Term Memory and Context on Streaming


The strength of stream segregation depends not only on the characteristics of the test sequence, but also on the short-term context in which it occurs. Rogers and Bregman (1993) presented listeners with ABA_ triplet patterns that were preceded by induction sequences of different types. They found that the streaming of a test sequence was enhanced when it was preceded by an induction sequence whose properties were similar to those of the test sequence, showing that short-term memory played a role here. As a further context effect, Snyder et al. (2008) found that the range of pitch relationships that had been presented on previous trials affected the stream segregation boundary. Subjects were presented with sequences consisting of ABA_ triplets that were defined by tones of different frequency. It was found that the larger the frequency difference between tones on the preceding trial (and even on trials before the preceding trial), the less the streaming on the current trial. Whether or not auditory streaming occurs can even be influenced by cues from a different modality. Rahne, Bockmann, Specht, and Sussman (2007) presented subjects with perceptually ambiguous tone sequences that could be perceived either as one or as two streams. Concomitant visual sequences that were designed to promote either an integrated or a segregated percept influenced the perceptual organization of the tone sequences in the direction of the visual cue.

G. Streaming and Attention


As described earlier, Van Noorden (1975) observed that when sequences were presented that consisted of two tones in alternation, there was a region of ambiguity within which the listener could direct his attention at will, and so hear either one or two streams. More recently, a number of studies have explored the influence of attention on stream segregation in detail. An experiment by Carlyon, Cusack, Foxton, and Robertson (2001) consisted of several conditions. In one condition, a sequence of ABA_ triplets was presented to the left ear, while no sounds were presented to the right ear. Subjects reported continuously how many streams they heard, and so necessarily attended to the presented sequence. In another condition, the same sequence was presented to the left ear. However, during the first 10 s, the subjects made judgments on a series of noise bursts that were simultaneously presented to the right ear; they then switched attention and made streaming judgments on the left ear sequence. In a further condition, the subjects received the same stimuli as in the two-task condition, but were asked to ignore the noise bursts and concentrate only on the left ear sequence. The buildup of streaming was considerably attenuated in the condition where the subjects attended to the noise bursts compared with the other

206

Diana Deutsch

conditions, indicating that attention played an important role in the process (see also Carlyon, Plack, Fantini, & Cusack, 2003). It is possible, however, that in this experiment switching attention from the noise bursts to the tone sequence may have caused a resetting of the streaming process similar to that which occurs when other features of the sequence are abruptly changed (Anstis & Saida, 1985). Evidence for this view was provided by Cusack, Deeks, Aikman, and Carlyon (2004), who showed that interpolating silent gaps in the sequence to be judged had the same effect as attention switching in reducing the buildup of streaming. Studies involving physiological measures have also indicated that attention can modulate the streaming process. Snyder, Alain, and Picton (2006) presented subjects with repeating ABA_ triplets and recorded their event-related potentials (ERPs) while they either attended to the sequence or watched a silent movie during its presentation. Stream segregation developed when the subjects were attending to the sequence, and this correlated with ERP enhancements; however, the ERP effect was reduced when the subjects ignored the sequence. Elhalali, Xiang, Shamma, and Simon (2009) arrived at a similar conclusion from an experiment in which subjects were required to attend either to a repeating target tone that was surrounded by random maskers or to the background maskers themselves. Recordings using magnetoencephalography (MEG) showed that attention strongly enhanced the neural representation of the attended target in the direction of boosting foreground perception. Other studies using measures such as the mismatch negativity component of the ERPan index of preattentive acoustic processinghave indicated that stream formation can occur even when subjects are performing a task that draws attention away from the presented sounds (cf. Sussman, Horvath, Winkler, & Orr, 2007; Sussman, Ritter, & Vaughan, 1999). It appears, therefore, that streaming can develop preattentively, though it can also be influenced by attention focusing. The preattentive streaming of sounds on the basis of complex attributes such as timbre, and the involvement of memory in the streaming process, are in accordance with the model of attention advanced by Deutsch and Deutsch (1963), which proposes that attention selection is determined by the analysis of information at very high levels of processing. The issue of attention selection in grouping is revisited in Section VI.

H. Brain Mechanisms Underlying Streaming


During the past decade, there has been a flowering of interest in brain mechanisms underlying auditory streaming. These have involved recordings from neural units in animals, as well as brain scanning in human subjects. As described earlier, when a rapid sequence of tones is continuously presented, and it forms either an ABA_ triplet pattern or an ABAB pattern, the tendency to hear the sequence as two streams builds with repetition, and builds more strongly as the frequency separation between the tones increases and the tempo and duration of the sequence increase. To investigate the basis of these effects, Fishman, Reser,

6. Grouping Mechanisms in Music

207

Arezzo, and Steinschneider (2001) recorded activity from neural units in the primary auditory cortex of awake macaque monkeys to sequences consisting of ABAB tones that differed in frequency. They adjusted the frequency of the A tone so as to elicit the strongest response at the recording site, and they then varied the frequency of the B tone. At slow tempi, the unit showed responses to both the A and B tones. However, at faster tempi, the units responses to the B tones weakened as the frequency separation between the A and B tones increased. At large frequency separations and fast tempi, the units responses were predominantly to the A tones. By extrapolation, the same process can be assumed to have occurred in units that responded most strongly to the B tones (see also Fishman, Arezzo, & Steinschneider, 2004). Based on these findings, Fishman and colleagues proposed that streaming results from a number of response characteristics of the auditory systemfrequency selectivity of individual units, forward suppression across units, and adaptation. Arising from these response characteristics, the A and B tones activate more distinct neuronal populations with increasing frequency separation between the tones, and also with increasing tempo and duration of the sequence, so giving rise to stream segregation. Although this model was originally applied to pure tones of differing frequency, it can also be applied to the streaming of complex tones based on pitch, spectral envelope, spatial location, timbre, and so onindeed wherever different populations of units can be hypothesized to subserve perception of different attribute values (Shamma & Micheyl, 2010). From another perspective, a number of researchers have investigated the relationship between neural responses to signals that produce streaming and percepts of these signals by human subjects. For example, Micheyl, Tian, Carlyon, and Rauschecker (2005) studied neural responses in the primary auditory cortex of awake rhesus monkeys to tone sequences that would be expected to produce streaming. These responses corresponded well to perceptual changes reported by human subjects when presented with similar stimuli. An even more direct measure of the neural correlates of streaming involves having human subjects make psychophysical streaming judgments while their patterns of brain activity are simultaneously monitored. Cusack (2005) employed sequences of ABA_ triplets whose parameters were in the region in which percepts fluctuated between a single coherent stream and two segregated streams (Van Noorden, 1975). Subjects made judgments of one or two streams while their brain activity was monitored using functional magnetic resonance imaging (fMRI). More activity was found in the posterior intraparietal sulcus during the two-stream than the one-stream percept, even with the stimulus parameters held constant. In a similar vein, Gutschalk, Micheyl, Melcher, Rupp, Scherg, and Oxenham (2005) examined the neural bases of streaming in human subjects, using both behavioral measures and concomitant MEG. Employing sequences of ABA_ triplets, they showed that manipulating the tempo and the frequency difference between the alternating tones resulted in changes in the auditory evoked field; these changes corresponded closely to the degree of streaming reported by the subjects. The authors also created sequences consisting of ABA_ triplets in the region

208

Diana Deutsch

of ambiguity that produced a bistable percept of either one or two streams. They found that even though the stimulus parameters were held constant, patterns of activation covaried with the subjects percepts. From the patterns of activation they observed, the authors concluded that streaming most likely arose from nonprimary auditory cortex. Later, Gutschalk, Oxenham, Micheyl, Wilson, and Melcher (2007) presented human subjects with ABBB sequences consisting of harmonic complex tones with varying fundamental frequencies but identical spectral envelopes. As the pitch separation between the alternating tones increased, the subjects were more likely to hear two streams. Along with the development of streaming at the perceptual level, cortical activity as measured by fMRI and MEG increased, both in primary auditory cortex and in surrounding nonprimary areas, with patterns strongly resembling those found for pure tones (see also Wilson, Melcher, Micheyl, Gutschalk, & Oxenham, 2007). Other studies have produced evidence that streaming occurs in regions below the level of the cortex. Using fMRI, Kondo and Kashino (2009) demonstrated the involvement of the medial geniculate body of the thalamus in streaming by human subjects. An even more striking finding was obtained by Pressnitzer, Sayles, Micheyl, and Winter (2008) from single units in the cochlear nucleus of anaesthetized guinea pigs. (The cochlear nucleus receives input from the cochlear nerve, and so is the first way station along the auditory pathway.) The responses from this region were found to be similar to those from the cortex, and displayed all the functional properties that were needed for streaming to occur. Furthermore, perceptual responses obtained from human subjects correlated well with the neurometric responses obtained from the guinea pig cochlear nucleus. As a possible explanation for the neural substrates of streaming at this low level, the responses from the cochlear nucleus could be modulated by centrifugal projections from higher-level structures, including the cortex (Suga & Ma, 2003; Winer, 2006). We now ask whether streaming is mediated by activity in the left hemisphere, or the right, or both. A number of studies exploring the neural correlates of streaming based on pitch have found either no difference in activation between the left and right hemispheres, or activation primarily in the right hemisphere (Cusack, 2005; Gutschalk et al., 2005, 2007; Snyder et al., 2006; Wilson et al., 2007). In contrast, Deike, Gaschler-Markefski, Brechmann, and Scheich (2004) and Deike, Scheich, and Brechmann (2010) found activation primarily in the left hemisphere when subjects were asked to segregate A from B tones continuously in sequences where the tones differed in timbre or in pitch. As suggested by these authors, differences in task requirements may have been responsible for the different patterns of hemispheric activity that were obtained in the various studies.

V.

Grouping and Phrase Structure

In the foregoing sections, we have considered ways in which the listener groups sequences of tones into separate parallel streams in a polyphonic texture. We now

6. Grouping Mechanisms in Music

209

turn to the aspect of grouping in music whereby the listener divides sequences of tones into coherent subsequences that are separated by temporal boundaries. It is generally agreed that sequences in Western tonal music are represented by the listener as tonal-temporal hierarchiesnotes combine to form motives, which in turn combine to form phrases, which in turn combine to form phrase groups, and so on until the level of the entire piece is reached (Deutsch & Feroe, 1981; Meyer, 1956, 1973; Lerdahl & Jackendoff, 1983; Narmour, 1990, 1999; Salzer, 1962; Schenker, 1956, 1973; Temperley, 2001; Thomson, 1999). The division of the musical surface into hierarchically organized chunks confers enormous processing advantages, as discussed in detail in Chapter 7. Here we discuss the cues whereby such groupings are formed. Tenney and Polansky (1980), in a computational study of grouping in music, stressed the role of temporal proximity, as well as changes in values of other attributes such as pitch and dynamics. Later, Lerdahl and Jackendoff (1983) proposed that grouping boundaries are placed at longer intervals between note onsets (after rests, at the ends of slurs, at long intervals between attacks, or after long notes), and also at changes in values of attributes such as pitch range and dynamics. In an ` ge (1987) asked subjects to listen to experimental investigation of this issue, Delie excerpts of Western classical music, and to mark boundaries between groupings. The perceived boundaries were found to correspond largely to Lerdahl and Jackendoff s proposed grouping cues, with the strongest effects occurring after long notes, followed by changes in dynamics and timbre (see also Clarke & Krumhansl, 1990; Frankland & Cohen, 2004). In general, grouping by temporal proximity has emerged as the most powerful cue for the perception of phrase boundaries. Other work has shown that this cue can, in turn, have a pronounced effect on the perception of pitch patterns. Handel (1973) had subjects identify repeating patterns that consisted of dichotomous elements of differing pitch. Identification performance was high when the patterns were temporally segmented in accordance with pitch structure, but low when temporal segmentation and pitch structure were placed in conflict. Further, Dowling (1973a) presented patterns that consisted of five-tone sequences that were separated by pauses, and subjects made recognition judgments concerning test sequences that were embedded in these patterns. Performance levels were higher when the test sequence was presented in a single temporal segment than when a pause was inserted between its elements. Using more elaborate configurations, Deutsch (1980) presented subjects with sequences of tones, which they recalled in musical notation. The sequences were either hierarchically structured according to the rules of Deutsch and Feroe (1981) (see Chapter 7 for details), or they were composed of the same set of tones but arranged in haphazard fashion. When the tones were presented at equal temporal intervals, performance levels for the structured sequences were very high, whereas they were low for the unstructured sequences. This finding shows that listeners exploit musical knowledge acquired through long-term exposurein this case knowledge of the pitch alphabets used in Western tonal musicin order to group notes into phrases.

210

Diana Deutsch

Returning to the issue of temporal segmentation, Deutsch (1980) found that performance levels were further enhanced when structured sequences were temporally segmented in accordance with their pitch structure. However, when the sequences were segmented in conflict with their pitch structure, the subjects instead formed perceptual groupings based on temporal segmentation. In consequence, they were unable to take advantage of the pitch structure inherent in such sequences, so that performance levels were again low. (See Chapter 7, Figure 14, for the types of temporal segmentation that were employed.) What happens when grouping by pitch proximity and temporal segmentation are set in opposition to each other? Hamaoui and Deutsch (2010) performed a series of experiments to assess the relative strengths of these two cues. Sequences of 12 tones were constructed in which pitch proximity suggested one type of grouping (e.g., four groups of three tones each) and temporal segmentation suggested an opposing type of grouping (in this case, three groups of four tones each). In the default condition, tones were 200 ms in duration and were separated by 100-ms pauses. The tones within a subsequence moved in semitone steps, and the pitch distances employed to suggest grouping by pitch proximity were 2, 5, and 11 semitones. For example, in the sequence shown in the upper part of Figure 12, boundaries between subsequences were marked by distances of 2 semitones, and in the sequences shown in the lower part of the figure, these boundaries were marked by distances of 11 semitones. To suggest grouping by temporal segmentation, the pauses following every third or fourth tone in the sequence were increased by durations varying from 15 to 60 ms. As expected, the larger the pitch distance between groups of tones, the greater the tendency to form groupings based on pitch proximity. However, the temporal cue to grouping was found to be surprisingly powerful, frequently overriding cues provided by large pitch distances when the durations of the pauses were increased by amounts as small as 30 ms. As illustration, the data produced by one subject are shown in Figure 13.

(a)

(b)

Figure 12 Examples of sequences used to study grouping by temporal segmentation, when this was placed in opposition to grouping by pitch proximity. Here all sequences ascended in semitone steps. (a) Boundaries between subsequences marked by distances of 2 semitones; (b) boundaries between subsequences marked by distances of 11 semitones. From Hamaoui and Deutsch (2010).

6. Grouping Mechanisms in Music

211

In further experiments, Hamaoui and Deutsch (2010) presented subjects with sequences of tones that were hierarchically structured in accordance with the rules of Deutsch and Feroe (1981), together with control sequences that were unstructured but otherwise matched in terms of pitch relationships. The subjects formed groupings based on hierarchical pitch structure, and these groupings were considerably more resistant to the temporal cue than were the matched unstructured sequences; further, groupings that were based both on pitch proximity and hierarchical structure were even more resistant to the temporal cue. The influence of pitch proximity on the formation of coherent patterns was also shown in an experiment by Deutsch (1978). Subjects were asked to judge whether two tones were the same or different in pitch when these were separated by a sequence of intervening tones. Performance levels improved as the average pitch distance between the tones in the intervening sequence was reduced (see Chapter 7, Figure 23). This indicated that when the tones in the intervening sequence were proximal in pitch, they formed a network of pitch relationships to which the test tones were anchored. Statistical evidence that pitch proximity is involved in phrasing was provided by Huron (2006) in a study of musical intervals in roughly 4,600 folk songs. The average interval size within phrases was found to be 2.0 semitones, whereas that between the end of one phrase and the beginning of the next was significantly largerat 2.9 semitones.

100% % Judgments based on pitch 80% 60% 40% 20% 0% 0 15 30 45 Intertone interval increment (ms) 60

Figure 13 Groupings based on pitch proximity when this cue was placed in conflict with small increments in the pauses between tones. Sequences such as those shown in Figure 12 were presented. In the default condition, tones were 200 ms in duration and separated by 100-ms pauses. Data from one subject are displayed, showing that very small increments in the pauses between tones served as grouping cues, even overriding grouping based on large pitch distances. Solid line indicates grouping by 11 semitone distances; dashed line by 5 semitone distances, and dotted line by 2 semitone distances. From Hamaoui and Deutsch (2010).

212

Diana Deutsch

A study by Tan, Aiello, and Bever (1981) provided further evidence that knowledge of musical structure contributes to the grouping of tone sequences into phrases. These authors presented melodies consisting of two phrases that were determined by their implied harmonic structure, with no other cues to phrase boundaries. Each melody was then followed by a two-tone probe. It was found that musically trained subjects were better able to identify the probe when it had occurred within a phrase than when it crossed a phrase boundary. Interestingly, grouping of pitch patterns can also occur based on short-term learning of statistical probabilities between adjacent tones, even in the absence of long-term structural cues. This was shown by Saffran, Johnson, Aslin, and Newport (1999) who created words consisting of three-tone patterns, with the tones within words chosen at random from within an octave range. The words were then presented in random orderings, so that there were high transitional probabilities between tones within words, and low transitional probabilities between tones across words. Listeners rapidly learned to group and recognize the words that were formed in this way. Although melodic phrase structure frequently coincides with metrical structure, this does not necessarily occur (Lerdahl & Jackendoff, 1983; Temperley, 2001). As illustration, at the beginning of Chopins Waltz in Dw major (Op. 64, No. 1) the melody is composed of the repeating four-note pattern (G-Aw-C-Bw). This pattern is at variance with the metrical structure, so that instead of perceiving a repeating four-note pattern, listeners perceive the two alternating six-note patterns (G-Aw-C-Bw-G-Aw) and (C-Bw-G-Aw-C-Bw), as illustrated in Figure 14. So here grouping by metrical structure overrides grouping by repetition of the pitch pattern. The question also arises as to whether grouping boundaries should be considered to apply to the entire texture of a passage or to different melodic lines separately. Although a rule specifying consistent divisions simplifies many analyses (Lerdahl & Jackendoff, 1983), exceptions can easily be found. Figure 15 shows measures 38 of Bachs Italian Concerto. The lower and middle voices alternate between a twonote and a four-note phrase. Yet the phrasing of the higher voice cuts across the boundaries defined by the other voices, so that different groupings are perceived in parallel.

VI.

Grouping of Simultaneous Tone Sequences

When listening to ensemble performances, we are generally presented with multiple sequences of tones that arise in parallel from different regions of space. We can then inquire into the principles that govern the perceptual grouping of such configurations. Do we form parallel linkages between tones that are similar in pitch, in loudness, or in timbre? Do we invoke spatial location as a prominent grouping cue? We shall see that many factors are involved in such grouping, and that they interact in complex ways.

6. Grouping Mechanisms in Music

213

Perceived grouping Repeated 4-note grouping

Figure 14 Measures 36 of Chopins Walze in Dw major (Op. 64, No. 1). Here grouping by metrical structure overrides grouping by repetition of a pitch pattern. The melody is composed of the repeating four-note pattern [G-Aw-C-Bw]. This pattern is at variance with the metrical structure, so that instead of perceiving a repeating four-note pattern, listeners perceive two alternating six-note pitch patterns [G-Aw-C-Bw-G-Aw] and [C-Bw-G-Aw-C-Bw].

Figure 15 Perceptual grouping in measures 47 of the second movement of Bachs Italian Concerto. While the lower and middle voices alternate between a two-note and four-note phrase, the phrasing of the higher voice cuts across the boundaries defined by the other voices, so that different groupings are perceived in parallel.

The problem faced by the auditory system in parsing simultaneous streams of sound that emanate from different regions of space is far more difficult than that faced by the visual system in parsing a visual array. The visual system is presented with a spatial layout of elements at the periphery (with the exception of depth perception). In contrast, space in the auditory system is not mapped directly onto the receptor surface, so the listener is required to infer the spatial origins of sounds by indirect means. Inferred sound location must therefore provide a much less reliable cue for the analysis of auditory scenes than is provided by spatial location for the analysis of visual scenes. In addition, sounds are by their nature fleeting, so that scrutinizing each sound element in series is not feasible for auditory scene analysis. It is not surprising, therefore, that listening to complex music can be quite prone to error, and that powerful illusory conjunctions can occur. When we hear a tone, we attribute a pitch, a loudness, a timbre, and we hear the tone as emanating from a particular spatial location. Each tone, as it is perceived, may then be described as a bundle of attribute values. If our perception is veridical, this bundle reflects the characteristics and location of the emitted sound. However, when multiple sequences of tones are presented simultaneously from different regions of space, these bundles of attribute values may fragment and recombine incorrectly, so that

214

Diana Deutsch

illusory conjunctions result. These illusory conjunctions can sometimes be due to random error, but they can also reflect the operation of multiple decision mechanisms in the grouping process.

A. The Scale Illusion and Related Phenomena


The scale illusion, which was first devised by Deutsch (1975a, 1975b) results from illusory conjunctions of pitch and location. The pattern that gives rise to this illusion is shown in the upper portion of Figure 16. It consists of a major scale, with successive tones alternating from ear to ear. The scale is played simultaneously in both ascending and descending form, such that whenever a tone from the ascending scale is in the right ear, a tone from the descending scale is in the left ear; and vice versa. The sequence is played repeatedly without pause. When listening to this pattern through earphones, people frequently experience the illusion shown in the lower portion of Figure 16. A melody corresponding to the higher tones is heard as coming from one earphone (in right-handers, this is generally the earphone on the right), while a melody corresponding to the lower tones is heard as coming from the opposite earphone. When the earphone positions are reversed, the apparent locations of the higher and lower tones often remain fixed. This gives rise to the curious impression that the higher tones have migrated from one earphone to the other, and that the lower tones have migrated in the opposite direction. (A minority of listeners instead hear a single melodic line that consists of the higher tones alone, and little or nothing of the lower tones; other listeners obtain yet different illusions, as described in Deutsch, 1975b). In experiencing the scale illusion, then, grouping by pitch proximity is so powerful that not only are the tones organized melodically in accordance with this principle, but they are also frequently reorganized in space in accordance with their melodic reorganization. Such spatial reorganization is in agreement with other
Pattern Right

Left

Percept Right

Figure 16 The pattern that produces the scale illusion, and the percept most commonly obtained. When this pattern is played through stereo headphones, most listeners hear two melodic lines that move in contrary motion. The higher tones all appear to be coming from one earphone, and the lower tones from the other, regardless of where each tone is coming from.

Left

6. Grouping Mechanisms in Music

215

findings showing that, in the absence of further supporting cues, differences in ear of input may have only small effects on how components of a tone complex are grouped together (Beerends & Houtsma, 1989; Darwin & Ciocca, 1992), and that other grouping cues can themselves influence the perceived spatial origins of components of a sound complex (Hill & Darwin, 1993). Although in producing the scale illusion the auditory system arrives at conclusions that are wildly wrong, the illusion is based on a listening strategy that is generally conducive to realistic interpretations of our natural environment. It is unlikely that a source in one location is producing a set of tones that leap around in pitch, while another source in a different location is producing a different set of tones that also leap around in pitch. It is far more probable that a sequence of tones in one pitch range has originated from one source, and that another sequence of tones in a different pitch range has originated from a different source. So we exploit pitch proximity as a cue to determine how these tones should be grouped together, and we infer their perceived locations on this basis (Deutsch, 1975a, 1987). Variants of the scale illusion are readily produced. One of these, called the chromatic illusion, is illustrated in Figure 17. A chromatic scale that ranges over two octaves is presented in both ascending and descending form, with the individual tones switching from ear to ear in the same way as in the scale illusion. When the pattern is played in stereo, most listeners hear a higher line that moves down an octave and up again, together with a lower line that moves up an octave and down again, with the two meeting in the middle. Yet when each channel is played separately, the pattern is heard correctly as a series of tones that leap around in pitch. In Figure 17, the smoothing out of the visual representation of the percept reflects well the way the sounds are perceptually reorganized.

Pattern Right

Left

Percept Right

Left

Figure 17 The pattern that produces a version of the chromatic illusion, and the way it is most often perceived. Adapted from Deutsch (1995).

216

Diana Deutsch

The cambiata illusion, which was first devised by Deutsch (2003), is based on the same principle. Figure 18 shows the pattern that is presented to the listener via earphones, together with the illusion that is most often obtained. The tones that are presented via each earphone leap around in pitch. However, the percept that is most frequently obtained consists of a higher melody that is composed of three tones that are close in pitch, together with a lower melody that is also composed of three tones that are close in pitch. And again, the higher tones often appear to be emanating from one earphone and the lower tones from the other. Further, as with the scale illusion, there are substantial individual differences in how the cambiata illusion is perceived, with right-handers tending to hear the higher tones as coming from the right earphone and the lower tones from the left one. Butler (1979a) found evidence that the perceptual reorganization that occurs in the scale illusion also occurs in a broad range of musical situations. He presented the scale illusion pattern through spatially separated loudspeakers instead of earphones and asked subjects to notate what they heard. In some conditions, the patterns were composed of piano tones, and differences in timbre were introduced between the sounds coming from the two speakers. Butler found that, despite these variations, virtually all responses reflected grouping by pitch proximity, so that higher and lower melodic lines were perceived, rather than the patterns that were in fact presented. When differences in timbre were introduced between the tones presented from the two speakers, a new tone quality was heard, but it appeared to be coming simultaneously from both speakers.

Pattern Right

Left

Percept Right

Left

Figure 18 The pattern that produces a version of the cambiata illusion, and the way it is most often perceived. Adapted from Deutsch (2003).

6. Grouping Mechanisms in Music

217

To determine whether these findings generalize to other configurations, Butler presented listeners with the melodic patterns shown in Figure 19. Again, virtually all responses reflected grouping by pitch range. For both these patterns, a perceptual reorganization occurred, so that a melody corresponding to the higher tones appeared to be coming from one earphone or loudspeaker, and a melody corresponding to the lower tones appeared to be coming from the other one. Such effects even occur on listening to live music in concert halls. There is an interesting passage at the beginning of the final movement of Tchaikovskys Sixth Symphony (The Pathetique). As shown in Figure 20, the notes from the theme alternate between the first and second violin parts, and the notes from the accompaniment alternate reciprocally (see Butler, 1979b, for a discussion). The passage, however, is not perceived as it is performed; rather, one violin part appears to be playing the theme and the other the accompaniment. This is true even with the orchestra arranged in 19th century fashion, so that the first violins are to the left of the audience and the second violins to their right. Whether it was Tchaikovskys intention to produce a spatial illusion here, or whether he expected the audience to hear the theme waft back and forth between the two sides of space, we shall never know. However, there is a legend that the conductor Arthur Nikisch urged Tchaikovsky to rescore this passage so that the first violins would play the entire theme and the second violins the accompaniment. Tchaikovsky refused to change his scoring; however, Nikisch rescored the passage anyway, and so created a second school of performance of this passage. The reasons for the argument between these two great musicians are unknown, but some conductors still prefer to perform the rescored version rather than Tchaikovskys original one (Carlson, 1996).

(a) Right

Pattern (b)

Left Percept (d)

(c) Right

Left

Figure 19 Patterns used to study grouping of simultaneous sequences in the experiment of Butler (1979a). Adapted from Butler (1979a).

218

Diana Deutsch

(a) Vn. I

Pattern as played

Vn. II

(b) Vn. I

Pattern as perceived

Vn. II

Figure 20 Beginning of the final movement of Tchaikovskys Sixth Symphony (The Pathetique). The upper portion of the figure shows the pattern as it is played, and the lower portion shows how it is generally perceived.

Another example of such spatial reorganization occurs at the end of the second movement of Rachmaninoffs Second Suite for Two Pianos. Here the first and second pianos play different patterns, each composed of the same two tones. However, it appears to the listener that one piano is consistently playing the higher tone, and the other piano the lower one (Sloboda, 1985). To return to the experiment of Deutsch (1975b), it is noteworthy that all subjects formed perceptual groupings based on overall pitch range. Rather than following the pattern purely on the basis of local (note-to-note) proximity, they either heard all the tones as two nonoverlapping pitch streams, or they heard the higher tones and little or nothing of the lower ones. No subject reported hearing a full ascending or descending scale as part of the pattern. This shows that the subjects were invoking global pitch range as well as local pitch proximity in making grouping judgments. A related finding was obtained by Van Noorden (1975), who presented an ascending sequence of tones in rapid alternation with a descending one, and subjects heard higher and lower melodic lines that moved in contrary motion. Tougas and Bregman (1985, 1990) observed an analogous perceptual organization of simultaneous ascending and descending glides. However, the perceptual reorganization of tones in space was not explored in these two studies. The perceptual tendency to form melodic streams based on overall pitch range is in line with the rule prohibiting voice crossing in counterpoint, and is reflected in the tendency by composers to avoid part crossing in polyphonic musican effect documented by Huron (1991a) in an analysis of the polyphonic works of J. S. Bach. Interestingly, although Bach avoided part crossing when writing in two parts,

6. Grouping Mechanisms in Music

219

he avoided it even more assiduously when writing in three or more parts. Huron concluded that Bach was attempting to minimize the perceptual confusion that might otherwise have occurred as the density of sound images increased. Do differences in timbre affect perception of the scale illusion? As described earlier, Butler (1979a) found that moderate differences in timbre did not alter the basic effect. However, Smith, Hausfeld, Power, and Gorta (1982) used tones with substantial timbral differences (one stream was generated by a synthesized piano and another by a synthesized saxophone) and found that timbre was then used as a basis for grouping. In a further experiment, Gregory (1994) generated a number of different instrument tones and used these in various combinations to construct ascending and descending scales. When there was little or no difference in timbre between the scales, listeners perceived higher and lower pitch streams, as described in Deutsch (1975b). However, when substantial differences in timbre were introduced, listeners tended to use these differences as cues for streaming. We can here point out that composers frequently exploit timbre as a carrier of melodic motion (Erickson, 1975), and place different instrument tones in the same pitch range, recognizing that listeners form groupings on the basis of instrument typeas in the Beethoven passage shown in Figure 8. However, the difference in timbre needs to be salient for this device to be effective. A study by Saupe, Koelsch, and Rubsamen (2010) illustrates the difficulty experienced by listeners in judging simultaneous tone sequences on the basis of timbre, when the timbral differences are small and conflict with grouping by pitch proximity. These authors created brief compositions consisting of three melodic parts, each with a different computer-generated timbre (violin, saxophone, and clarinet). The subjects were asked to focus attention on the violin part and to detect falling jumps in this part, ignoring those in the saxophone and clarinet parts. When the three parts emanated from the same spatial location, the error rates in detecting the pitch jumps were extremely high. So far, we have been considering situations in which the tones coming from two sources are simultaneous, and this leads us to inquire what happens when temporal disparities are introduced. As we saw earlier, one would expect listeners to interpret such disparities as indicating that the sounds were originating from different sources, and so to separate them out perceptually. As a result, we would expect streams to be formed here on the basis of spatial location rather than pitch proximity. As a test of this hypothesis, Deutsch (1979) had subjects identify melodic patterns in which tones were distributed across ears in various ways. There were four conditions in the experiment, and these are illustrated in Figure 21. In Condition A, the melody was delivered to both ears simultaneously, and the performance level here was very high. In Condition B, the tones forming each melody were switched haphazardly between ears, and this difference in ear of input caused performance levels to drop considerably. Condition C was exactly as Condition B, except that the melody was accompanied by a drone: Whenever a tone from the melody was delivered to the right ear, the drone was delivered to the left ear, and vice versa. So in this condition, both ears again received input simultaneously, even though

220

Diana Deutsch

Condition A

R/L R/L R/L

R/L R/L R/L R/L R/L

Error rate 5.7%

L B L C R L D L

R 40.1%

R 16.1%

R L

L R

L R

R L

L R

R L

L R 54.7%

Figure 21 Examples of the ways in which tones were distributed between the two ears in the experiment on binaural integration of melodic patterns. Also shown are the error rates in the different conditions. Adapted from Deutsch (1979).

the melody was still switching from ear to ear. The presence of the contralateral drone caused identification performance to return to a high level. In Condition D, the drone again accompanied the melody, but it was now delivered to the same ear as the melody component, so that input was again to one ear at a time. In this condition, performance again dropped substantially. We can conclude that when tones emanate from different spatial locations, temporal relationships between them are important determinants of how they are perceptually grouped together. When tones arrive at both ears simultaneously, they are organized sequentially on the basis of pitch proximity. However, when the tones at the two ears are clearly separated in time, grouping by spatial location is so powerful as to virtually abolish the listeners ability to integrate them into a single melodic stream. A similar conclusion was reached by Judd (1979), who generated two repeating patterns consisting of tones that were presented to the left and right ears in alternation. Subjects listened to pairs of these patterns and judged on each trial whether the members of the pair were the same or different. On half the trials, the tones presented to each ear were separated by silent gaps, and on the other half, the gaps were filled with noise. Judd found that identification performance was enhanced in the presence of the noise, and concluded that the noise degraded the localization information, and so discouraged grouping by spatial location. To return to the study of Deutsch (1979), a second experiment was performed to explore intermediate cases, in which the tones arriving at the two ears were not

6. Grouping Mechanisms in Music

221

strictly synchronous but instead overlapped in time. Specifically, in some conditions the components of the melody and the drone were offset from each other by 15 ms. These intermediate conditions produced intermediate results: Identification of the melody in the presence of the contralateral drone when the two were asynchronous was poorer than when the melody and drone were strictly synchronous, but better than when the tones from the melody switched between the ears without an accompanying drone. It is interesting that Berlioz (1948) came to a similar conclusion from the composers perspective. In Treatise on Instrumentation, he wrote:
I want to mention the importance of the different points of origin of the tonal masses. Certain groups of an orchestra are selected by the composer to question and answer each other; but this design becomes clear and effective only if the groups which are to carry on the dialogue are placed at a sufficient distance from each other. The composer must therefore indicate on his score their exact disposition. For instance, the drums, bass drums, cymbals, and kettledrums may remain together if they are employed, as usual, to strike certain rhythms simultaneously. But if they execute an interlocutory rhythm, one fragment of which is given to the bass drums and cymbals, the other to kettledrums and drums, the effect would be greatly improved and intensified by placing the two groups of percussion instruments at the opposite ends of the orchestra, that is, at a considerable distance from each other.

Findings from the scale illusion and its variants, together with the drone experiment, indicate that perception of musical passages can indeed be influenced profoundly by the spatial arrangements of instruments. When a pattern of tones is played at a rapid tempo, and the tones comprising the pattern are distributed between different instruments, listeners may be unable to integrate them into a single coherent stream. Such integration is more readily accomplished when the tones played by different instruments overlap in time. However there is a trade-off: As the amount of temporal overlap increases, our ability to identify the spatial origins of the tones decreases, and when the tones are presented simultaneously, spatial illusions are likely to occur. We now return to the question of how perception of simultaneous patterns of tones may be influenced by whether the higher tones are to the listeners right and the lower tones to the left, or the reverse. As described earlier, when listening to the scale illusion, right-handers tend to hear higher tones on their right and lower tones on their left, regardless of where the tones are coming from. This means that simultaneous tone combinations of the high-right/low-left type tend to be correctly localized, whereas combinations of the high-left/ low-right type tend to be localized less correctly. Deutsch (1985) examined this effect in detail. Musically trained subjects were presented with simultaneous sequences of tones, one to each ear, and they transcribed the tones in musical notation. Each ear received a haphazard ordering of the first six tones of a major scale, so that for some chords the tone at the right ear was higher and the tone at the left ear was lower (high-right/low-left chords),

222

Diana Deutsch

and for other chords this spatial disposition was reversed (high-left/low right chords). Subjects were asked to notate the tones that were presented to one ear, and to ignore those that were presented to the other ear. When the subjects were attending to the right ear, they notated more higher than lower tones tones correctly. Furthermore, more higher than lower tones intruded from the left ear into their notations. In contrast, when the subjects were attending to the left ear, they correctly notated virtually the same number of higher and lower tones, with a marginal advantage to the lower tones. Further, more lower than higher tones intruded from the right ear into their notations. In other words, just as in the scale illusion, tones comprising high-right/low-left chords were correctly localized more often than those comprising high-left/low-right chords. In a further experiment, subjects were asked to notate the entire pattern, disregarding ear of input. It was found that they notated more tones correctly when these formed high-right/low-left chords than when they formed high-left/low-right chords. So we can conclude that there is an advantage to high-right/low-left dispositions, both in terms of where the tones appear to be coming from and also how well their pitches are perceived. To the extent that effects of this sort occur in live musical situations, the following line of reasoning may be advanced. In general, contemporary seating arrangements for orchestras are such that, from the performers point of view, instruments with higher registers are to the right and those with lower registers to the left. As an example, Figure 22 shows a seating plan for the Chicago Symphony, viewed from the back of the stage. Considering the strings, the first violins are

Podium
Ce l los
Fir st v

Basses

ioli

Harps

ns

nd co Se

Vio las

Oboes

Flutes

vio lin s

Pia

no

Bassoons
Tro

Clarinets
ss io n

Figure 22 Seating plan for the Chicago Symphony, as viewed from the back of the stage. Adapted from Machlis (1977).

b Tu a
mb on es

Horns
Pe
ani

u rc

Trumpets

Timp

6. Grouping Mechanisms in Music

223

to the right of the second violins, which are to the right of the violas, which are to the right of the cellos, which in turn are to the right of the basses. Consider also the brasses: The trumpets are to the right of the trombones, which are to the right of the tuba. Furthermore, the flutes are to the right of the oboes, and the clarinets to the right of the bassoons. It is interesting that the same principle tends to hold for other musical ensembles also. We may speculate that this type of spatial disposition has evolved by trial and error because it is conducive to optimal performance. However, this presents us with a paradox. Because the audience sits facing the orchestra, this disposition is mirror-image reversed from their point of view: Instruments with higher registers tend to be to the audiences left, and those with lower registers to their right. So for the audience, this spatial arrangement should cause perceptual difficulties. In particular, instruments with low registers that are to the audiences right should be less well perceived and localized. As described in Deutsch (1987), it is unclear how this problem can be resolved so as to produce an optimal seating arrangement for both the performers and the audience. A further illusion in which tones are perceptually reorganized in space was devised by Deutsch (1995), and is called the glissando illusion. The pattern that gives rise to this illusion consists of a synthesized oboe tone of constant pitch, played together with a pure tone whose pitch glides up and down. The listener is seated in front of two stereophonically separated loudspeakers, with one to his left and the other to his right. The signals are repeatedly alternated between the loudspeakers such that when a segment of the oboe tone emanates from one speaker a segment of the glissando emanates from the other one. On listening to this pattern, the oboe tone is heard correctly as switching between loudspeakers; however, the segments of the glissando appear to be joined together seamlessly, so that it appears to be emanating from a source that moves slowly around in space in accordance with its pitch motion. In a large-scale study, Deutsch et al. (2007) found that listeners localized the glissando in a variety of ways. Most right-handers heard the glissando move between left and right, and also between low and high in space, as its pitch moved between low and high; however nonright-handers were less likely to obtain this percept. Whereas in the scale illusion, most listeners perceive patterns of tones that appear to be coming from fixed spatial locations, the glissando is always perceived as coming from a source that moves slowly through space. In addition, many subjects obtain a percept that results from a synthesis of illusory motion both between left and right and also between low and high. In consequence, the glissando is sometimes heard as tracing an elliptical path between a position low and to the left when its pitch is lowest, and high and to the right when its pitch is highest, as illustrated in Figure 23. We now turn to hypothesized neurological substrates for these illusions. In all cases, there is a strong tendency for right-handers to hear the higher tones as on the right side of space, and the lower tones as on the left side, whereas left-handers and mixed-handers do not show such a strong tendency. Based on findings relating patterns of cerebral dominance to handedness (Isaacs, Barr, Nelson, & Devinsky,

224

Diana Deutsch

Figure 23 Original drawing by a subject to illustrate his perception of the glissando illusion. The glissando was perceived as tracing an elliptical path through space, from low and to the left when its pitch was lowest, and high and to the right when its pitch was highest. From Deutsch et al. (2007).

2006; Knecht et al., 2000; Luria, 1969; Milner, Branch, & Rasmussen, 1966; Pujol, Deus, Losilla, & Capdevila, 1999), we can conjecture that there is a tendency to perceive higher tones as on the dominant side of space and lower tones as on the nondominant side. This conjecture is supported by other findings indicating that sounds tend to be perceived as on the side of space contralateral to the hemisphere that is more strongly activated (Hari, 1990; Jacobs, Feldman, Diamond, & Bender, 1973; Penfield & Perot, 1963). So we can conclude that these illusory patterns give rise to greater activity in the dominant hemisphere in response to higher tones and to greater activity in the nondominant hemisphere in response to lower tones.

B. The Octave Illusion


In the experiments on simultaneous sequences so far described, grouping by pitch proximity was the rule when both ears received input simultaneously; grouping by spatial location occurred only when temporal disparities were introduced between the tones that were presented to the two ears. The octave illusion, which was discovered by Deutsch (1974), provides an interesting exception, because here following by spatial location occurs even when the tones delivered to the two ears are strictly simultaneous.

6. Grouping Mechanisms in Music

225

Stimulus Right ear Left ear Percept Right ear Left ear 0 1 Time (sec) 1

Figure 24 Pattern that produces the octave illusion, together with the percept most commonly obtained. Filled boxes indicate tones at 800 Hz and unfilled boxes tones at 400 Hz. When this pattern is played through stereo headphones, most righthanded listeners perceive an intermittent high tone in the right ear that alternates with an intermittent low tone in the left ear. Adapted from Deutsch (1974).

Pattern (a) R L R L R L

Figure 25 Pattern that produces the octave illusion together with the percept most commonly obtained, shown in musical notation.

(b)

Percept R

The pattern that gives rise to the octave illusion is shown in the upper portions of Figures 24 and 25. Two pure tones spaced an octave apart are repeatedly presented in alternation without pause. The identical sequence is presented to both ears simultaneously; however, the tones are out of step with each other, so that when the right ear receives the high tone the left ear receives the low tone and vice versa. There are strong differences between listeners in how the octave illusion in perceived (Deutsch, 1974, 1975a, 1981, 1983a, 1983b, 1987, 1995, 2004). Most right-handers hear a single tone that switches from ear to ear, while its pitch simultaneously shifts back and forth between high and low. So it appears that the right ear is receiving the pattern high tonesilencehigh tonesilence while the left ear is receiving the pattern silencelow tonesilencelow tone. This percept is illustrated in the lower portions of Figures 24 and 25. When the

226

Diana Deutsch

Stimulus Left-ear pitch percept


Right ear Left ear

Right-ear pitch percept

Right ear Left ear

Right ear

R L R L R L Combined percept Localization percept

Left ear

Combined percept

Figure 26 Model showing how the output of two decision mechanisms, one determining perceived pitch and the other determining perceived location, can combine to produce the octave illusion. Filled boxes indicate tones at 800 Hz, and unfilled boxes indicate tones at 400 Hz. Adapted from Deutsch (1981).

earphone positions are reversed, the apparent locations of the high and low tones often remain fixed: The tone that had appeared in the right ear continues to appear in the right ear, and the tone that had appeared in the left ear continues to appear in the left ear. This produces the bizarre impression that switching headphone positions has caused the high tone to migrate from one earphone to the other, and the low tone to migrate in the opposite direction. Deutsch (1975a) hypothesized that the octave illusion results from the combined operation of two separate decision mechanisms; one determines what pitch we hear, and the other determines where the tone appears to be coming from. The model is depicted in Figure 26. To provide the perceived pitches, the frequencies arriving at one ear are followed, and those arriving at the other ear are suppressed. However, each tone is localized at the ear that receives the higher frequency, regardless of whether a pitch corresponding to the higher or the lower frequency is perceived. We can take a listener who perceives the pitches delivered to his right ear. When the high tone is presented to the right and the low tone to the left, this listener hears a high tone, because it is presented to his right ear. The listener also localizes the tone in his right ear, because this ear is receiving the higher frequency. However, when the low tone is presented to the right ear and the high tone to the left, this listener now hears a low tone, because it is presented to his right ear, but he localizes the tone in the left ear instead, because this ear is receiving the higher frequency. The resultant illusory conjunction of pitch and location results in the percept of a high tone to the right that alternates with a low tone to the left. It can be seen that, on this model, reversing the positions of the earphones would not alter the basic percept. However, for a listener who follows the pitches presented to the left ear instead, holding the localization rule constant, the identical pattern would be heard as a high tone to the left alternating with a low tone to the

6. Grouping Mechanisms in Music

227

right. Later psychophysical experiments have provided further evidence for this model (cf. Deutsch, 1981; Deutsch & Roll, 1976). Since this model was proposed, substantial evidence for separate what and where pathways in the auditory system has been obtained, both in nonhuman primates (cf. Kaas & Hackett, 2000; Rauschecker, 1998; Recanzone, Guard, Phan, & Su, 2000; Tian, Reser, Durham, Kustov, & Rauschecker, 2001) and in human subjects (cf. Ahveninen et al., 2006; Altmann, Bledowski, Wibral, & Kaiser, 2007; Arnott, Binns, Grady, & Alain, 2004; Recanzone & Sutter, 2008). These findings provide clues to the neuroanatomical substrate of the octave illusion in terms of separate what and where decision mechanisms. Recently, Lamminmaki, Hari, and colleagues have provided evidence concerning the neuronatomical underpinnings of the octave illusion, placing the locus of both the what and where components in the auditory cortex. Lamminmaki and Hari (2000) focused on the where component. Using MEG, they recorded responses to 400-Hz and 800-Hz tones that were presented in different combinations at the two ears. The N100m response (100-ms response from the auditory cortex) at each hemisphere was found to be stronger to tone pairs in which the 800-Hz tone was presented contralaterally and the 400-Hz tone ipsilaterally than when the tone pairs were presented in the opposite configuration. Given that monaural sounds evoke a stronger N100m response in the hemisphere contralateral to the presented sound, and that listeners localize single sounds to the ear contralateral to the hemisphere in which more activation occurs (Hari, 1990), this finding agrees with the proposed lateralization component of the illusion; that is, lateralization of the perceived tone to the ear that receives the higher frequency. Lamminmaki, Mandel, Parkkonen, & Hari (in press) focused on the what component of the illusionthat is, the pattern of pitches that were perceived. Again using MEG, they recorded steady-state responses to all monaural and binaural combinations of 400-Hz and 800-Hz tones, presented as continuous sounds. The subjects were selected for obtaining a typical octave illusionthat is, a high tone in the right ear alternating with a low tone in the left ear. During dichotic presentation of frequencies corresponding to those in the octave illusion, the ipsilateral responses to the left ear tones were weaker, and those to right ear tones were stronger than when both ears received the same tone. Importantly, for the most paradoxical component of the illusionthat is, presentation of the high tone to the left ear and the low tone to the right ear, producing the illusory percept of a low tone in the left earresponses to the left ear tones were also weaker in the contralateral hemisphere. Taken together, these two sets of findings point to a neuroanatomical instantiation of the octave illusion in terms of separate what and where decision mechanisms. In other work on the neural underpinnings of the octave illusion, Ross, a ta nen (1996) questioned whether the illusion was present at Tervaniemi, and Na the level of the auditory cortex, or whether it was created higher in the processing stream. These authors presented subjects with the sequence producing the illusion, and intermittently inserted illusion-mimicking sequences of single tones that were presented monaurally. The oddball sequences elicited the mismatch negativity

228

Diana Deutsch

component of the ERP, which is thought to be generated in the auditory cortex, and to reflect perceived changes in sound properties. The authors concluded that the illusion is generated beyond the auditory cortex. However, the sounds as they are perceived in the illusion differ in subtle ways from those mimicking the illusion, and the mismatch negativity could well have picked up on these differences. For example, Sonnadara and Trainor (2005) found that when subjects who heard the illusion as a series of high tones in the right ear alternating with low tones in the left ear were presented with a pattern mimicking the illusion, the mimicking pattern appeared to be louder than the illusion-generating one. This finding is in line with that of Lamminmaki et al., who also showed that steady state responses in the auditory cortex to binaurally presented tones were suppressed compared with responses to tones that were presented monaurally. We can here note that the octave illusion has pronounced handedness correlates. Deutsch (1974) found that right-handers tended strongly to hear the high tone on the right and the low tone on the left, and to maintain this percept when the earphone positions were reversed. However, there was considerable variation among left-handers in terms of where the high and low tones appeared to be localized, and what type of illusion was obtained. From further studies, it was concluded that these findings reflected a tendency to perceive the pitches that were presented to the dominant rather than the nondominant ear (Deutsch, 1975a, 1981; 1983a, 1983b, 2004; Deutsch & Roll, 1976). In a further study, Deutsch (1983b) divided the subject population into three groupsright-handed, mixed-handed, and left-handed. The right-handers tended strongly to hear the high tone on the right and the low tone on the left. This tendency was less strong among mixed-handers and even less strong among lefthanders. Furthermore, for all three handedness groups, the tendency to perceive the high tone on the right and the low tone on the left was stronger among subjects with only right-handed parents and siblings than among those with a left- or mixed-handed parent or sibling. This pattern of results is in accordance with the literature relating patterns of cerebral dominance to handedness and familial handedness background (Ettlinger, Jackson, & Zangwill, 1956; Isaacs et al., 2006; Luria, 1969; Subirana, 1958), and indicates that in experiencing the octave illusion, listeners generally perceive the high tone on the dominant side of space and the low tone on the nondominant side. A recent finding has produced evidence that perception of the octave illusion may even serve as a reliable indicator of the direction of cerebral dominance in a given individual. Ferrier, Huiskamp, Alpherts, Henthorn, and Deutsch (in preparation) presented the octave illusion to 17 patients who were scheduled to undergo the Wada test to assess their pattern of cerebral dominance, in preparation for brain surgery for the relief of temporal or frontal lobe epilepsy. It was found that all patients heard the high tone on the side contralateral to the hemisphere that was later determined by the Wada test to be dominant for speech. Based on this finding, we conjecture that the octave illusion could be used as a simple, noninvasive, and reliable test for the assessment of direction of cerebral dominancea possibility that has considerable clinical potential.

6. Grouping Mechanisms in Music

229

Other work has explored the signal parameters that are necessary to produce the octave illusion. McClurkin and Hall (1981) replaced the 400-Hz pure tone with a high-frequency complex tone with a 200-Hz fundamental. The illusion was found to persist, with the subjects again tending to hear the high-pitched tone in the right ear and the low-pitched tone in the left ear. This finding indicated that pitch rather than frequency region was responsible for the illusory percept most often obtained. Later, Brancucci, Padulo, and Tommasi (2009) observed that the illusion was not confined to tones that were related by exactly an octave, but also occurred with tones that were spaced at intervals that deviated from an octave by one or two semitones; however, the illusion deteriorated as the size of the intervals decreased further. This finding is in accordance with an earlier demonstration by Deutsch (1983a) in which the intervals formed by the two alternating tones were made to vary. In this demonstration, the illusion became degraded as the size of the interval decreased, so that at the minor third an entirely different perceptual impression was produced. Concerning tone duration, while Deutsch (1974, 1983b) used 250-ms tones, Zwicker (1984) found that the illusion sharpened with the use of 200-ms tones; however, Brancucci et al. (2009) reported that the effect was stronger with 500-ms tones. These temporal discrepancies most likely resulted from differences in other signal parameters used for generating the illusion. In a further study, Brannstrom and Nilsson (2011) replaced the 400-Hz and 800-Hz pure tones with narrow-band noises with overlapping spectra, and had subjects make judgments on the pitch and localization components of the illusion separately. Most subjects perceived an illusion in terms of a dominant ear for pitch and lateralization by frequency, as in the two-channel model shown in Figure 26. They also found that the salience of the illusion increased with an increase in the high-frequency content of the noise signal. An interesting study by Brancucci, Lugli, Santucci, and Tommasi (2011) showed that once the octave illusion is induced, its effect can persist strongly. The subjects were presented first with a priming sequence consisting of the octave illusion pattern, and then repeatedly with a test sequence consisting of one of the alternating dichotic chords (either 400 Hz right/800 Hz left, or 800 Hz right/400 Hz left) for up to 6 s. For all the test sequences, the repeating dichotic chords continued to be heard as in the illusion.

C. Illusory Conjunctions and Attention


In the stereo illusions we have been describingthe scale illusion and its variants, the glissando illusion, and the octave illusionillusory conjunctions of pitch and location occur preattentively in most listeners. We can take a listener who clearly hears the octave illusion as a high tone to the right alternating with a low tone to the left (Figures 24 and 25). This listener can focus attention on either the high tone or the low one, or on either the right ear or the left one. When the listener is focusing attention on the low tone that is perceived as to the left, in reality the low tone is being presented to the right, and the high tonewhich is suppressed from perceptionis being presented to the left. An illusory conjunction of the low pitch

230

Diana Deutsch

with the left location therefore occurs despite focused attention on either the low tone or the left ear, so this illusion is not destroyed by attention focusing. A similar argument applies to the scale illusion: The listener who obtains a strong illusion such as shown in Figure 16 can focus attention on either the higher or the lower tones, and this does not cause the illusion to break down. Similarly, for a listener who obtains a strong glissando illusion, focusing attention on the glissando does not cause it to be heard correctly as leaping between the left and right loudspeakers. Further, in the study of Deutsch (1985), the illusory conjunctions of pitch and location occurred despite explicit instructions to attend to and notate the tones presented to one ear and ignore the other. These findings run counter to the suggestion, proposed for the case of vision, that the conjunction of features requires attentional control and that illusory conjunctions occur when stimuli are outside the focus of attention (Treisman & Gelade, 1980), because here the illusions occur even though the listener is focusing attention on the tone to be identified. Rather, such illusory conjunctions must reflect the outcome of separate decision mechanisms whose outputs combine preattentively so as lead to erroneous percepts. Evidence for preattentive conjunction of different attribute values has also been found by others in studies using mismatch negativity (Gomes, Bernstein, Ritter, Vaughan, & Miller, 1997; Sussman, Gomes, Manette, a ta nen, & Nousak, Ritter, & Vaughan, 1998; Takegata, Huotilainen, Rinne, Na th, & Balazs, 2005). Winkler, 2001; Winkler, Czigler, Sussman, Horva A study by Deouell, Deutsch, Scabini, Soroker, and Knight (2008) on two patients with unilateral neglect provides further evidence that illusory conjunctions occur preattentively, and continue to occur when attention is focused on the illusory tones. Unilateral neglect generally occurs with damage to the nondominant hemisphere, and is often accompanied by auditory extinctionthe failure to perceive sounds that are presented on one side of space when other sounds are simultaneously presented on the opposite side. The patients were presented with the scale illusion through headphones, and they reported hearing a single stream of tones that smoothly descended and then ascendedas in the scale illusion. However, they also reported hearing all the tones in one ear and silence in the other ear. Since the extinguished tones were being perceived, they must have been erroneously grouped preattentively to one side of space. Other authors have also reported a high prevalence of illusory conjunctions for musical tones. Hall, Pastore, Acker, and Huang (2000) presented subjects with arrays of simultaneous and spatially distributed tones. The subjects were asked to search for specific cued conjunctions of values of pitch and instrument timbre. For example, the target sound could be that of a violin at a fundamental frequency of 509 Hz, followed by an array of simultaneously presented sounds (such as a violin at 262 Hz and a trombone at 509 Hz) that were differentially localized. The subjects judged in separate tasks whether a particular designated feature of timbre or pitch appeared in the array, and whether a combination of two such features appeared. Although the listeners were well able to identify either the pitch or the timbre alone, they made frequent errors in reporting the presence or absence of target conjunctions, with estimates of illusory conjunction rates ranging from 23% to 40%.

6. Grouping Mechanisms in Music

231

Other research has shown that illusory conjunctions of different attribute values can occur with serial presentation also. Thompson, Hall, and Pressing (2001) presented subjects with a target sequence that was followed by a probe tone. When the probe tone matched one target tone in pitch and a different target tone in duration, on over half the trials the subjects responded that the probe tone matched the same target tone in both pitch and duration.

D. Melody Perception from Phase-Shifted Tones


Another configuration that produces grouping of simultaneous pitch patterns by spatial location was described by Kubovy and colleagues. Kubovy, Cutting, and McGuire (1974) presented a set of simultaneous and continuous pure tones to both ears. They then phase-shifted one of the tones in one ear relative to its counterpart in the opposite ear. When these tones were phase-shifted in sequence, a melody was heard that corresponded to the phase-shifted tones; however, the melody was undetectable when the signal was played to either ear alone. Subjectively, the dichotically presented melody was heard as occurring inside the head but displaced to one side of the midline, while a background hum was heard as localized to the opposite side. So it appeared as though a source in one spatial position was producing the melody, while another source in a different spatial position was producing the background hum. Kubovy (1981) pointed out that there are two potential interpretations of this effect. First, the segregation of the melody from the noise could have been based on concurrent difference cues; that is, the target tone may have been segregated because its interaural disparityor apparent spatial locationdiffered from that of the background tones. Alternatively, the effect could have been based on successive difference cues; that is, the target tone may have been segregated because it had shifted its apparent position in space. In further experiments, Kubovy found that both concurrent and successive difference cues were involved in the effect.

VII.

Grouping of Equal-Interval Tone Complexes

A. Grouping by Pitch Proximity


Perceptual grouping principles emerge strongly in tone complexes whose components are separated by equal intervals. Octave-related complexes have been explored most extensively (see also Chapter 7). However, tones whose components are related by other intervals have also been explored, as have chords produced by combinations of two or more octave-related complexes. Shepard (1964) generated a series of tones, each of which was composed of 10 components that were separated by octaves. The amplitudes of the components were scaled by a fixed, bell-shaped spectral envelope, such that those in the middle of the musical range were highest and those at the extremes were lowest. Shepard then varied the pitch classes of the tones by shifting all their components up or down in log frequency.

232

Diana Deutsch

C B A A G G F F E C D D

Figure 27 The pitch class circle.

Subjects listened to successive pairs of such tones and judged whether they formed ascending or descending patterns. When the second tone was removed one or two steps clockwise from the first along the pitch class circle (Figure 27), listeners heard an ascending pattern; when the second tone was removed one or two steps counterclockwise, listeners heard a descending pattern instead. As the tones within a pair were separated by larger distances along the pitch class circle, the tendency for judgments to be determined by proximity gradually lessened, and when the tones were separated by exactly a half-octave, ascending and descending judgments occurred equally often. Based on these findings, Shepard produced a compelling demonstration. A series of tones was played that repeatedly traversed the pitch class circle in clockwise steps, so that it appeared to ascend endlessly in pitch: Cx sounded higher than C, D as higher than Cx, Dx as higher than D, . . . , Ax as higher than A, B as higher than Ax, C as higher than B, and so on without end. Counterclockwise motion gave rise to the impression of an endlessly descending series of tones. Risset (1969, 1971) produced a number of striking variants of Shepards demonstration. In one variant, a single gliding tone was made to traverse the pitch class circle in clockwise direction, so that it appeared to move endlessly upward in pitch. When the tone was made to glide in counterclockwise direction, it appeared to move endlessly downward. In another variant, a tone was made to glide clockwise around the pitch class circle, while the spectral envelope was made to glide downward in log frequency; in consequence, the tone appeared both to ascend and to descend at the same time (see also Charbonneau & Risset, 1973). Effects approaching pitch circularity have been generated by composers for hundreds of years, and can be found in works by Gibbons, Bach, Scarlatti, Haydn, and Beethoven, among others. In the 20th century, effective pitch circularities have been produced by composers such as Stockhausen, Krenek, Berg, Bartok, Ligeti, Tenny, and in particular Risset, using both natural instruments and computer-generated sounds. Braus (1995) provides an extensive discussion of such works. Circular pitches have even been put to effective use in movies. Richard King, the sound designer for the Batman movie The Dark Knight, employed an ever-ascending glide for the sound of Batmans vehicle, the Batpod. In an article for the Los Angeles Times, King wrote When played on a keyboard, it gives the illusion of greater and greater speed; the pod appears unstoppable.

6. Grouping Mechanisms in Music

233

Percept A

Percept B

Figure 28 Representation of alternative perceptual organizations in the experiment on grouping of simultaneous pairs of Shepard tones. Subjects grouped the pattern in accordance with harmonic proximity (Percept A) in preference to Percept B. From Deutsch (1988).

Returning to the experimental evidence, the work of Shepard and Risset showed that when other cues to height attribution are weak, listeners invoke proximity in making judgments of relative height for successively presented tones. We can then ask whether the auditory system might invoke proximity in making judgments of relative height for simultaneously presented tones also. In an experiment to examine this issue, Deutsch (1991) presented subjects with patterns that consisted of two simultaneous pairs of Shepard tones. In one pair, the second tone was a semitone clockwise from the first; in the other, it was a semitone counterclockwise. As expected from the earlier work, subjects organized these patterns sequentially in accordance with pitch proximity, so that they heard two melodic lines, one of which ascended by a semitone while the other descended by a semitone. However, as shown in Figure 28, the descending line could in principle be heard as higher and the ascending line as lower, so forming a harmonic grouping in accordance with proximity (Percept A), or the ascending line could be heard as higher and the descending line as lower, so forming a harmonic grouping that ran counter to proximity (Percept B). It was found that all subjects showed a strong tendency to organize the patterns so that they were grouped in accordance with proximity along the harmonic dimension. For example, the pattern in Figure 28 tended to be heard as Percept A rather than Percept B. In all the experiments so far described, the patterns employed were such that proximity along the pitch class circle co-occurred with proximity based on the spectral properties of the tones. The question then arises as to which of these two factors was responsible for the proximity effects that were obtained. This question was addressed by Pollack (1978) with respect to Shepards original experiment. He presented subjects with complex tones whose components were related by octaves or octave multiples, and found that as the spectral overlap between successively presented tones increased, the tendency to follow by proximity increased also. Pollack concluded that proximity along the spectral dimension was responsible for Shepards results. A similar conclusion was reached by Burns (1981), who found that the tendency to follow pairs of tones in accordance with spectral proximity was no greater when the tones were composed of octave-related components than when their components were related by other intervals.

234

Diana Deutsch

Spectral proximity effects have been used to produce other striking illusions. Risset (1986) described an illusion produced by a complex tone whose components were spaced at intervals that were slightly larger than an octave. He played this tone first at one speed and then at twice the speed, so that each component of the first tone had a corresponding component of the second tone with a slightly lower frequency. Listeners heard the second tone as lower than the first, indicating that they were invoking proximity between successive spectral components in making their judgments (see also Risset, 1969, 1971, 1978). A similar finding was reported by Schroeder (1986), who pointed out that this effect is analogous to certain phenomena in fractal geometry. In order to achieve pitch circularity, must the choice of materials be confined to highly artificial tones, or to several instrument tones playing simultaneously? If circular scales could be created from sequences of single tones, each of which comprised a full harmonic series, then the theoretical and practical implications of pitch circularity would be broadened. Benade (1976) pointed out that a good flautist, while playing a sustained note, can vary the relative amplitudes of the odd and even numbered harmonics so as to produce a remarkable effect. Suppose he starts out playing note A at F0 5 440 Hz; the listener hears this note as well defined both in pitch class and in octave. Suppose, then, that the performer changes his manner of blowing so that the amplitudes of the odd-numbered harmonics are gradually reduced relative to the even-numbered ones. At some point the listener realizes that he is now hearing the note A an octave higherthat is, corresponding to F0 5 800 Hzyet this octave transition had occurred without traversing the semitone scale. We can then conjecture that a tone consisting of a full harmonic series might be made to vary continuously between two octaves without traversing the helical path shown in Figure 29. If this were so, then pitch should be represented as a cylinder rather than as a helixas indicated by the dashed line between Dx and Dx in Figure 29. Indeed, Patterson, Milroy, and Allerhand (1993) and Warren, Uppenkamp,
Figure 29 The helical model of pitch. Musical pitch is shown as varying along both a linear dimension of height and also a circular dimension of pitch D class. The helix completes one full turn per octave, with the result that tones standing in octave relation are in close spatial proximity. The dashed line from D Dx to Dx indicates that the pitch of a tone can also be made to vary within the octave along the height dimension without traversing the helix, pointing D to a cylindrical rather than helical representation.

Height

A A G

6. Grouping Mechanisms in Music

235

Patterson, and Griffiths (2003) found that attenuating the odd harmonics of a complex tone relative to the even ones resulted in a perceived increase in the pitch height of the tone. Based on these findings, I reasoned that it should be possible to create pitch circularity from a bank of harmonic complex tones by appropriate manipulations of their odd and even harmonics. One begins with a bank of 12 harmonic complex tones, whose F0s range in semitone steps over an octave. For the tone with the highest F0, the odd and even harmonics are identical in amplitude. Then for the tone a semitone lower, the amplitudes of the odd harmonics are reduced relative to the even ones, so raising the perceived height of this tone. Then for the tone another semitone lower, the amplitudes of the odd harmonics are further reduced relative to the even ones, so raising the perceived height of this tone to a greater extent. One continues this way down the octave in semitone steps, until for the tone with the lowest F0, the odd harmonics no longer contribute to the tones perceived height. The tone with the lowest F0 is therefore heard as displaced up an octave, and pitch circularity is achieved. After some trial and error, I settled on the following parameters. Complex tones consisting of the first six harmonics were employed, and the amplitudes of the odd harmonics were reduced by 3.5 dB for each semitone step down the scale. When this bank of tones was presented with F0s in ascending semitone steps, listeners heard the sequence as eternally ascending. When the bank was played in descending semitone steps, the sequence was heard as eternally descending instead. Furthermore, when single gliding tones were used instead of steady-state tones, impressions of eternally ascending and descending glides were obtained. In a formal experiment, Deutsch, Dooley, and Henthorn (2008) employed such a bank of 12 tones, and created sequential pairings between each tone and each of the other tones. Listeners were then asked to judge for each tone pair whether it ascended or descended in pitch. When the tones within a pair were separated by a short distance along the pitch class circle, judgments were based almost entirely on proximity. This tendency decreased with increasing distance along the circle, but remained high even at a distance of 5 semitonesalmost half way around the circle. When the data were subjected to multidimensional scaling, strongly circular configurations were obtained. The intriguing possibility then arises that this algorithm could be employed to transform banks of natural instrument tones so that they would also exhibit pitch circularity. William Brent, then a graduate student at the University of California, San Diego, achieved considerable success using bassoon samples, and also some success with oboe, flute, and violin samples, and he has shown that the effect is not destroyed by vibrato. The possibility of creating circular banks of tones derived from natural instruments expands the scope of musical materials available to composers and performers. At the theoretical level, these demonstrations of pitch circularity indicate that pitch should be represented as a solid cylinder rather than as a helix (see also Deutsch, 2010.)

236

Diana Deutsch

Figure 30 Representation of the pattern used to obtain an endlessly ascending scale from a sequence of chords. The tones were octave-related complexes, generated under a trapezoidal spectral envelope. A global pitch movement was perceived, reflecting perceptual organization by common fate. Reprinted with permission from Nakajima et al. (1988); data from Teranishi (1982). 1988 by The Regents of the University of California.

B. Grouping by Common Fate


Returning to sequences composed of octave-related complexes, another perceptual grouping principle has been shown to operate. Teranishi (1982) created a set of major triads that were composed of octave-related complexes generated under a trapezoidal spectral envelope. When a subset of these triads was played in succession as shown in Figure 30, listeners obtained the impression of an endlessly ascending scale. However, as can be seen by perusal of Figure 30, the most proximal relationships between components of successive tones were not uniformly in the ascending direction. For example, taking the first two chords, the descending line G-Fx follows proximity more closely than the ascending line G-A. However, listeners followed global direction in perceiving this chord succession instead, so that they were basing their relative pitch judgments on an impression of global pitch movement, or common fate. In a follow-up study, Nakajima, Tsumura, Matsuura, Minami, and Teranishi (1988) also examined perception of successions of major triads that were produced by octave-related complexes. Paired comparison judgments involving such triads showed that whereas some subjects displayed a pitch circularity of an octave, others displayed a pitch circularity of roughly 1/3 octave. The authors concluded that the subjects were basing their judgments on the perception of global pitch movement (see also Nakajima, Minami, Tsumura, Kunisaki, Ohnishi, & Teranishi, 1991). In a related study, Allik, Dzhafarov, Houtsma, Ross, and Versfeld (1989) generated random chord sequences that were composed of octave-related complexes. When such chords were juxtaposed in time so that a sufficient number of successive components were related by proximity in the same direction, a global pitch movement in this direction was heard. In general, composers have frequently made use of a perceptual effect of common fate, by creating sequences of chords whose components moved in the same direction and by similar degrees, while the precise intervals between successive tones were varied. An example is given in Figure 31, which shows a passage from Debussys prelude Le Vent dans la Plaine. Here, the grouping of successive pitches by proximity alone should cause the listener to hear a number of repeating pitches, together with the falling-rising sequence (Dw-C-Dw-C);

6. Grouping Mechanisms in Music

237

Figure 31 A passage from Debussys prelude Le Vent dans la Plaine. The listener perceives this passage globally as a downward pitch movement, in accordance with the principle of common fate.

Log frequency

Time

however, these percepts are discarded in favor of an impression of a descending series of chords.

VIII. Relationships to Music Theory and Practice


In this chapter, we have explored a number of findings that elucidate the way our auditory system groups the components of music into perceptual configurations. Beyond their interest to psychologists, these findings have implications for music theory and practice. In treatises on music theory, we encounter a number of rules that instruct the student in the art of composition. Among these are the law of stepwise progression, which states that melodic progression should be by steps (i.e., a half step or a whole step) rather than by skips (i.e., more than a whole step) because stepwise progression is considered to be in some way stronger or more binding. Another law prohibits the crossing of voices in counterpoint. What is left unspecified is why these precepts should be obeyed: It is assumed that the reader will either follow them uncritically or recognize their validity by introspection. The findings that we have been reviewing provide such laws with rational bases by demonstrating the

238

Diana Deutsch

perceptual effects that occur when they are violated. This in turn enables musicians to make more informed compositional decisions. As a related point, with the advent of computer music, the composer is no longer bound by the constraints of natural instruments, but is instead faced with an infinity of compositional possibilities. As a result, it has become critically important to understand certain basic perceptual phenomena, such as the factors that lead us to fuse together the components of a spectrum so as to obtain a unitary sound image, and the factors that lead us to separate out components so as to obtain multiple sound images. Such knowledge is a necessary first step in the creation of new musical timbres. For similar reasons, we need to understand the principles by which we form simultaneous and successive linkages between different sounds, so that listeners will perceive musical patterns as intended by the composer. Finally, the illusions we have been exploring show that listeners do not necessarily perceive music in accordance with the written score, or as might be imagined from reading a score. Musical rules that have evolved through centuries of practical experience provide some ways of protecting the composer from generating music that could be seriously misperceived. However, with our new compositional freedom, there has emerged a particularly strong need to understand how music as it is notated and performed maps onto music as it is perceived. The findings reviewed here have brought us closer to realizing this goal, although much more remains to be learned.

Acknowledgments
I am grateful to Trevor Henthorn for help with the illustrations, and to Frank Coffaro for help with formatting the references. Preparation of this chapter was supported in part by an Interdisciplinary Research Award to the author from the University of California, San Diego.

References
a skela inen, I. P., Raij, T., Bonmassar, G., Devore, S., & Hamalainen, M., et Ahveninen, J., Ja al. (2006). Task-modulated what and where pathways in human auditory cortex. Proceedings of the National Academy of Sciences, 103, 1460814613. Allik, J., Dzhafarov, E. N., Houtsma, A. J. M., Ross, J., & Versfeld, N. J. (1989). Pitch motion with random chord sequences. Perception & Psychophysics, 46, 513527. Altmann, C. F., Bledowski, C., Wibral, M., & Kaiser, J. (2007). Processing of location and pattern changes of natural sounds in the human auditory cortex. NeuroImage, 35, 11921200. Anstis, S. M., & Saida, S. (1985). Adaptation to auditory streaming of frequency-modulated tones. Journal of Experimental Psychology: Human Perception and Performance, 11, 257271. Arnott, S. R., Binns, M. A., Grady, C. L., & Alain, C. (2004). Assessing the auditory dualpathway model in humans. NeuroImage, 22, 401408.

6. Grouping Mechanisms in Music

239

Assmann, P. F., & Summerfeld, A. Q. (1990). Modeling the perception of concurrent vowels: Vowels with different fundamental frequencies. Journal of the Acoustical Society of America, 88, 680697. Beauvois, M. W., & Meddis, R. (1997). Time decay of auditory stream biasing. Perception & Psychophysics, 59, 8186. Beerends, J. G., & Houtsma, A. J. M. (1989). Pitch identification of simultaneous dichotic two-tone complexes. Journal of the Acoustical Society of America, 85, 813819. Benade, A. H. (1976). Fundamentals of musical acoustics. Oxford, UK: Oxford University Press. Berlioz, H. (1948). In R. Strauss (Ed.), Treatise on instrumentation. New York, NY: Kalmus. Bey, C., & McAdams, S. (2003). Postrecognition of interleaved melodies as an indirect measure of auditory stream formation. Journal of Experimental Psychology: Human Perception and Performance, 29, 267279. Braaten, R. F., & Leary, J. C. (1999). Temporal induction of missing birdsong segments in European starlings. Psychological Science, 10, 162166. Brancucci, A., Lugli, V., Santucci, A., & Tommasi, L. (2011). Ear and pitch segregation in Deutschs octave illusion persist following switch from stimulus alternation to repetition. Journal of the Acoustical Society of America, 130, 21792185. Brancucci, A., Padulo, C., & Tommasi, L. (2009). Octave illusion or Deutschs illusion? Psychological Research, 73, 303307. Brannstrom, K. J., & Nilsson, P. (2011). Octave illusion elicited by overlapping narrowband noises. Journal of the Acoustical Society of America, 129, 32133220. Braus, I. (1995). Retracing ones steps: An overview of pitch circularity and Shepard tones in European music, 15501990. Music Perception, 12, 323351. Bregman, A. S. (1978). The formation of auditory streams. In J. Requin (Ed.), Attention and performance (Vol. VII, pp. 6376). Hillsdale, NJ: Erlbaum. Bregman, A. S. (1990). Auditory scene analysis: The perceptual organization of sound. Cambridge, MA: MIT Press. Bregman, A. S., Abramson, J., Doehring, P., & Darwin, C. J. (1985). Spectral integration based on common amplitude modulation. Perception & Psychophysics, 37, 483493. Bregman, A. S., & Campbell, J. (1971). Primary auditory stream segregation and perception of order in rapid sequences of tones. Journal of Experimental Psychology, 89, 244249. Bregman, A. S., & Dannenbring, G. L. (1977). Auditory continuity and amplitude edges. Canadian Journal of Psychology, 31, 151159. Bregman, A. S., Liao, C., & Levitan, R. (1990). Auditory grouping based on fundamental frequency and formant peak frequency. Canadian Journal of Psychology, 44, 400413. Bregman, A. S., & Pinker, S. (1978). Auditory streaming and the building of timbre. Canadian Journal of Psychology, 32, 2031. Broadbent, D. E., & Ladefoged, P. (1957). On the fusion of sounds reaching different sense organs. Journal of the Acoustical Society of America, 29, 708710. Burns, E. (1981). Circularity in relative pitch judgments for inharmonic complex tones: the Shepard demonstration revisited, again. Perception & Psychophysics, 30, 467472. Butler, D. (1979a). A further study of melodic channeling. Perception & Psychophysics, 25, 264268. Butler, D. (1979b). Melodic channeling in a musical environment. Paper presented at the Research Symposium on the Psychology and Acoustics of Music, Kansas. Carlson, S. (1996). Dissecting the brain with sound. Scientific American, 275, 112115. Carlyon, R. P. (1991). Discriminating between coherent and incoherent frequency modulation of complex tones. Journal of the Acoustical Society of America, 89, 329340.

240

Diana Deutsch

Carlyon, R. P. (1992). The psychophysics of concurrent sound segregation. Philosophical Transactions of the Royal Society of London, Series B, 336, 347355. Carlyon, R. P. (2004). How the brain separates sounds. Trends in Cognitive Sciences, 8, 465471. Carlyon, R. P., Cusack, R., Foxton, J. M., & Robertson, I. H. (2001). Effects of attention and unilateral neglect on auditory stream segregation. Journal of Experimental Psychology: Human Perception and Performance, 27, 115127. Carlyon, R. P., & Gockel, H. (2007). Effects of harmonicity and regularity on the perception of sound sources. In W. A. Yost, A. N. Popper, & R. R. Fay (Eds.), Auditory perception of sound sources (pp. 191213). New York, NY: Springer. Carlyon, R. P., Plack, C. J., Fantini, D. A., & Cusack, R. (2003). Cross-modal and nonsensory influences on auditory streaming. Perception, 32, 13931402. de jugements de hauteur sonore. Charbonneau, G., & Risset, J. C. (1973). Circularite Comptes Rendus de lAcademie des Sciences, Serie B, 277, 623. Chowning, J. M. (1980). Computer synthesis of the singing voice. In J. Sundberg (Ed.), Sound generation in winds, strings, and computers (pp. 413). Stockholm, Sweden: Royal Swedish Academy of Music. Ciocca, V., & Bregman, A. S. (1987). Perceived continuity of gliding and steady-state tones through interrupting noise. Perception & Psychophysics, 42, 476484. Ciocca, V., & Darwin, C. J. (1999). The integration of nonsimultaneous frequency components into a single virtual pitch. Journal of the Acoustical Society of America, 105, 24212430. Clarke, E. F., & Krumhansl, C. L. (1990). Perceiving musical time. Music Perception, 7, 213251. Cusack, R. (2005). The intraparietal sulcus and perceptual organization. Journal of Cognitive Neuroscience, 17, 641651. Cusack, R., Deeks, J., Aikman, G., & Carlyon, R. P. (2004). Effects of location, frequency region, and time course of selective attention on auditory stream analysis. Journal of Experimental Psychology: Human Perception and Performance, 30, 643656. Cusack, R., & Roberts, B. (2000). Effects of differences in timbre on sequential grouping. Perception & Psychophysics, 62, 11121120. Cusack, R., & Roberts, B. (2004). Effects of differences in the pattern of amplitude envelopes across harmonics on auditory stream segregation. Hearing Research, 193, 95104. Dannenbring, G. L. (1976). Perceived auditory continuity with alternately rising and falling frequency transitions. Canadian Journal of Psychology, 30, 99114. Dannenbring, G. L., & Bregman, A. S. (1976). Stream segregation and the illusion of overlap. Journal of Experimental Psychology: Human Perception and Performance, 2, 544555. Darwin, C. J. (1981). Perceptual grouping of speech components differing in fundamental frequency and onset-time. Quarterly Journal of Experimental Psychology, 33A, 185207. Darwin, C. J. (1984). Perceiving vowels in the presence of another sound: constraints on formant perception. Journal of the Acoustical Society of America, 76, 16361647. Darwin, C. J. (2005a). Pitch and auditory grouping. In C. J. Plack, A. J. Oxenham, R. R. Fay, & A. N. Popper (Eds.), Springer handbook of auditory research: Pitch neural coding and perception (pp. 278305). New York: Springer. Darwin, C. J. (2005b). Simultaneous grouping and auditory continuity. Perception & Psychophysics, 67, 13841390. Darwin, C. J., & Carlyon, R. P. (1995). Auditory grouping. In B. C. J. Moore (Ed.), Hearing (pp. 387424). San Diego, CA: Academic Press.

6. Grouping Mechanisms in Music

241

Darwin, C. J., & Ciocca, V. (1992). Grouping in pitch perception: Effects of onset asynchrony and ear of presentation of a mistuned component. Journal of the Acoustical Society of America, 91, 33813390. Darwin, C. J., Ciocca, V., & Sandell, G. R. (1994). Effects of frequency and amplitude modulation on the pitch of a complex tone with a mistuned harmonic. Journal of the Acoustical Society of America, 95, 26312636. Deike, S., Gaschler-Markefski, B., Brechmann, A., & Scheich, H. (2004). Auditory stream segregation relying on timbre involves left auditory cortex. Neuroreport, 15, 15111514. Deike, S., Scheich, H., & Brechmann, A. (2010). Active stream segregation specifically involves the left human auditory cortex. Hearing Research, 265, 3037. ` ge, I. (1987). Grouping conditions in listening to music: an approach to Lerdahl & Delie Jackendoffs grouping preference rules. Music Perception, 4, 325360. Demany, L. (1982). Auditory stream segregation in infancy. Infant Behavior & Development, 5, 261276. Deouell, L. Y., Deutsch, D., Scabini, D., Soroker, N., & Knight, R. T. (2008). No disillusions in auditory extinction: perceiving a melody comprised of unperceived notes. Frontiers of Human Neuroscience, 1, 16. Deutsch, D. (1974). An auditory illusion. Nature, 251, 307309. Deutsch, D. (1975a). Musical illusions. Scientific American, 233, 92104. Deutsch, D. (1975b). Two-channel listening to musical scales. Journal of the Acoustical Society of America, 57, 11561160. Deutsch, D. (1978). Delayed pitch comparisons and the principle of proximity. Perception & Psychophysics, 23, 227230. Deutsch, D. (1979). Binaural integration of melodic patterns. Perception & Psychophysics, 25, 399405. Deutsch, D. (1980). The processing of structured and unstructured tonal sequences. Perception & Psychophysics, 28, 381389. Deutsch, D. (1981). The octave illusion and auditory perceptual integration. In J. V. Tobias, & E. D. Schubert (Eds.), Hearing research and theory (Vol. I, pp. 99142). New York, NY: Academic Press. Deutsch, D. (1983a). Auditory illusions, handedness, and the spatial environment. Journal of the Audio Engineering Society, 31, 607620. Deutsch, D. (1983b). The octave illusion in relation to handedness and familial handedness background. Neuropsychologia, 21, 289293. Deutsch, D. (1985). Dichotic listening to melodic patterns, and its relationship to hemispheric specialization of function. Music Perception, 3, 128. Deutsch, D. (1987). Illusions for stereo headphones. Audio Magazine, 71, 3648. Deutsch, D. (1988). Lateralization and sequential relationships in the octave illusion. Journal of the Acoustical Society of America, 83, 365368. Deutsch, D. (1991). Pitch proximity in the grouping of simultaneous tones. Music Perception, 9, 185198. Deutsch, D. (1995). Musical illusions and paradoxes [CD]. La Jolla, CA: Philomel Records. Deutsch, D. (1996). The perception of auditory patterns. In W. Prinz, & B. Bridgeman (Eds.), Handbook of perception and action (Vol. 1, pp. 253296). San Diego, CA: Academic Press. Deutsch, D. (2003). Phantom words, and other curiosities [CD]. La Jolla, CA: Philomel Records. Deutsch, D. (2004). The octave illusion revisited again. Journal of Experimental Psychology: Human Perception and Performance, 30, 355364.

242

Diana Deutsch

Deutsch, D. (2010). The paradox of pitch circularity. Acoustics Today, July Issue, 815. Deutsch, D., Dooley, K., & Henthorn, T. (2008). Pitch circularity from tones comprising full harmonic series. Journal of the Acoustical Society of America, 124, 589597. Deutsch, D., & Feroe, J. (1981). The internal representation of pitch sequences in tonal music. Psychological Review, 88, 503522. Deutsch, D., Hamaoui, K., & Henthorn, T. (2007). The glissando illusion and handedness. Neuropsychologia, 45, 29812988. Deutsch, D., & Roll, P. L. (1976). Separate what and where decision mechanisms in processing a dichotic tonal sequence. Journal of Experimental Psychology: Human Perception and Performance, 2, 2329. Deutsch, J. A., & Deutsch, D. (1963). Attention: some theoretical considerations. Psychological Review, 70, 8090. Dowling, W. J. (1973a). Rhythmic groups and subjective chunks in memory for melodies. Perception & Psychophysics, 4, 3740. Dowling, W. J. (1973b). The perception of interleaved melodies. Cognitive Psychology, 5, 322337. Dowling, W. J., Lung, K. M., & Herrbold, S. (1987). Aiming attention in pitch and time in the perception of interleaved melodies. Perception & Psychophysics, 41, 642656. Elhalali, M., Xiang, J., Shamma, S. A., & Simon, J. Z. (2009). Interaction between attention and bottom-up saliency mediates the representation of foreground and background in an auditory scene. Public Library of Science: Biology, 7, 114. Erickson, R. (1975). Sound structure in music. Berkeley, CA: University of California Press. Ettlinger, G., Jackson, C. V., & Zangwill, O. L. (1956). Cerebral dominance in sinistrals. Brain, 79, 569588. Fay, R. R. (1998). Auditory stream segregation in goldfish (Carassius auratus). Hearing Research, 120, 6976. Ferrier, C. H., Huiskamp, G .J. M., Alpherts, W. C. P., Henthorn, T., & Deutsch, D. (in preparation). The octave illusion: A noninvasive tool for presurgical assessment of language lateralization. Fishman, Y. I., Arezzo, J. C., & Steinschneider, M. (2004). Auditory stream segregation in monkey auditory cortex: effects of frequency separation, presentation rate, and tone duration. Journal of the Acoustical Society of America, 116, 16561670. Fishman, Y. I., Reser, D. H., Arezzo, J. C., & Steinschneider, M. (2001). Neural correlates of auditory stream segregation in primary auditory cortex of the awake monkey. Hearing Research, 151, 167187. Frankland, B. W., & Cohen, A. J. (2004). Parsing of melody: quantification and testing of the local grouping rules of Lerdahl and Jackendoffs A Generative Theory of Tonal Music. Music Perception, 21, 499543. Gardner, R. B., Gaskill, S. A., & Darwin, C. J. (1989). Perceptual grouping of formants with static and dynamic differences in fundamental frequency. Journal of the Acoustical Society of America, 85, 13291337. Gomes, H., Bernstein, R., Ritter, W., Vaughan, H. G., & Miller, J. (1997). Storage of feature conjunctions in transient auditory memory. Psychophysiology, 34, 712716. Gregory, A. H. (1994). Timbre and auditory streaming. Music Perception, 12, 161174. Grimault, N., Bacon, S. P., & Micheyl, C. (2002). Auditory stream segregation on the basis amplitude-modulation rate. Journal of the Acoustical Society of America, 111, 13401348. Grimault, N., Micheyl, C., Carlyon, R. P., Arthaud, P., & Collett, L. (2000). Influence of peripheral resolvability on the perceptual segregation of harmonic complex tones differing in fundamental frequency. Journal of the Acoustical Society of America, 108, 263271.

6. Grouping Mechanisms in Music

243

Gutschalk, A., Micheyl, C., Melcher, J. R., Rupp, A., Scherg, M., & Oxenham, A. J. (2005). Neuromagnetic correlates of streaming in human auditory cortex. Journal of Neuroscience, 25, 53825388. Gutschalk, A., Oxenham, A. J., Micheyl, C., Wilson, E. C., & Melcher, J. R. (2007). Human cortical activity during streaming without spectral cues suggest a general neural substrate for auditory stream segregation. Journal of Neuroscience, 27, 1307413081. Hall, M. D., Pastore, R. E., Acker, B. E., & Huang, W. (2000). Evidence for auditory feature integration with spatially distributed items. Perception & Psychophysics, 62, 12431257. Hamaoui, K., & Deutsch, D. (2010). The perceptual grouping of musical sequences: Pitch and timing as competing cues. In S. M. Demorest, S. J. Morrison, & P. S. Campbell (Eds.), Proceedings of the 11th International Conference on Music Perception and Cognition, Seattle, Washington (pp. 8187). Handel, S. (1973). Temporal segmentation of repeating auditory patterns. Journal of Experimental Psychology, 101, 4654. Hari, R. (1990). The neuromagnetic method in the study of the human auditory cortex. In F. Grandori, M. Hoke, & G. L. Romani (Eds.), Auditory evoked magnetic fields and electric potentials: Advances in audiology (pp. 222282). Basel, Switzerland: S. Karger. Hartmann, W. M., & Goupell, M. J. (2006). Enhancing and unmasking the harmonics of a complex tone. Journal of the Acoustical Society of America, 120, 21422157. Heise, G. A., & Miller, G. A. (1951). An experimental study of auditory patterns. American Journal of Psychology, 64, 6877. Helmholtz, H. von (1925). Helmholtzs physiological optics (Translated from the 3rd German ed., 19091911 by J. P. C. Southall, Ed.). Rochester, NY: Optical Society of America. Helmholtz, H. von (1954). On the sensations of tone as a physiological basis for the theory of music (2nd English ed.). New York, NY: Dover. Hill, N. J., & Darwin, C. J. (1993). Effects of onset asynchrony and of mistuning on the lateralization of a pure tone embedded in a harmonic complex. Journal of the Acoustical Society of America, 93, 23072308. Houtgast, T. (1972). Psychophysical evidence for lateral inhibition in hearing. Journal of the Acoustical Society of America, 51, 18851894. Houtsma, A. J. M., Rossing, T. D., & Wagenaars, W. M. (1987). Auditory demonstrations. Eindhoven, The Netherlands, and the Acoustical Society of America. Hukin, R. W., & Darwin, C. J. (1995a). Comparison of the effect of onset asynchrony on auditory grouping in pitch matching and vowel identification. Perception & Psychophysics, 57, 191196. Hukin, R. W., & Darwin, C. J. (1995b). Effects of contralateral presentation and of interaural time differences in segregating a harmonic from a vowel. Journal of the Acoustical Society of America, 98, 13801386. Huron, D. (1991a). The avoidance of part-crossing in polyphonic music: perceptual evidence and musical practice. Music Perception, 9, 93104. Huron, D. (1991b). Tonal consonance versus tonal fusion in polyphonic sonorities. Music Perception, 9, 135154. Huron, D. (1993). Note-onset asynchrony in J. S. Bachs two-part inventions. Music Perception, 10, 435444. Huron, D. (2001). Tone and voice: A derivation of the rules of voice-leading from perceptual principles. Music Perception, 19, 164. Huron, D. (2006). Sweet anticipation: Music and the psychology of expectation. Cambridge, MA: MIT Press.

244

Diana Deutsch

Isaacs, K. L., Barr, W. B., Nelson, P. K., & Devinsky, O. (2006). Degree of handedness and cerebral dominance. Neurology, 66, 18551858. Iverson, P. (1995). Auditory stream segregation by musical timbre: effects of static and dynamic acoustic attributes. Journal of Experimental Psychology: Human Perception and Performance, 21, 751763. Izumi, A. (2002). Auditory stream segregation in Japanese monkeys. Cognition, 82, B113B122. Jacobs, L., Feldman, M., Diamond, S. P., & Bender, M. B. (1973). Palinacousis: persistent or recurring auditory sensations. Cortex, 9, 275287. Judd, T. (1979). Comments on Deutschs musical scale illusion. Perception & Psychophysics, 26, 8592. Kaas, J. H., & Hackett, T. A. (2000). Subdivisions of auditory cortex and processing streams in primates. Proceedings of the National Academy of Sciences USA, 97, 1179311799. Knecht, S., Drager, B., Deppe, M., Bobe, L., Lohmann, H., & Floel, A., et al. (2000). Handedness and hemispheric language dominance in healthy humans. Brain, 123, 25122518. Kondo, H. M., & Kashino, M. (2009). Involvement of the thalamocortical loop in the spontaneous switching of percepts in auditory streaming. Journal of Neuroscience, 29, 1269512701. Kubovy, M. (1981). Concurrent pitch segregation and the theory of indispensable attributes. In M. Kubovy, & J. Pomerantz (Eds.), Perceptual organization (pp. 5598). Hillsdale, NJ: Erlbaum. Kubovy, M., Cutting, J. E., & McGuire, R. M. (1974). Hearing with the third ear: dichotic perception of a melody without monaural familiarity cues. Science, 186, 272274. Lamminmaki, S., & Hari, R. (2000). Auditory cortex activation associated with octave illusion. Neuroreport, 11, 14691472. Lamminmaki, S., Mandel, A., Parkkonen, L., & Hari, R. (in press). Binaural interaction and pitch perception as contributors to the octave illusion. Journal of the Acoustical Society of America. Lerdahl, F., & Jackendoff, R. (1983). A generative theory of tonal music. Cambridge, MA: MIT Press. Luria, A. B. (1969). Traumatic aphasia. The Hague, The Netherlands: Mouton. MacDougall-Shackleton, S. A., Hulse, S. H., Gentner, T. Q., & White, W. (1998). Auditory scene analysis by European starlings (Sturnus vulgaris): perceptual segregation of tone sequences. Journal of the Acoustical Society of America, 103, 35813587. Machlis, J. (1977). The enjoyment of music (4th ed.). New York, NY: Norton. Marin, C. M. H., & McAdams, S. (1991). Segregation of concurrent sounds: II. effects of spectral envelope tracing, frequency modulation coherence, and frequency modulation width. Journal of the Acoustical Society of America, 89, 341351. McAdams, S. (1984). The auditory image: A metaphor for musical and psychological research on auditory organization. In W. R. Crozier, & A. J. Chapman (Eds.), Cognitive processes in the perception of art (pp. 298324). Amsterdam, The Netherlands: North-Holland. McAdams, S. (1989). Segregation of concurrent sounds: I. effects of frequency modulation coherence. Journal of the Acoustical Society of America, 86, 21482159. McAdams, S., & Bertoncini, J. (1997). Organization and discrimination of repeating sound sequences by newborn infants. Journal of the Acoustical Society of America, 102, 29452953. McClurkin, H., & Hall, J. W. (1981). Pitch and timbre in a two-tone dichotic auditory illusion. Journal of the Acoustical Society of America, 69, 592594. McDermott, J. H., & Oxenham, A. J. (2008). Spectral completion of partially masked sounds. Proceedings of the National Academy of Sciences USA, 105, 59395955.

6. Grouping Mechanisms in Music

245

McDermott, J. H., Wrobleski, D., & Oxenham, A. J. (2011). Recovering sound sources from embedded repetition. Proceedings of the National Academy of Sciences USA, 108, 11881193. McNabb, M. M. (1981). Dreamsong: The composition. Computer Music Journal, 5, 3653. Meyer, L. B. (1956). Emotion and meaning in music. Chicago, IL: University of Chicago Press. Meyer, L. B. (1973). Explaining music: Essays and explorations. Berkeley, CA: University of California Press. Micheyl, C., Tian, B., Carlyon, R. P., & Rauschecker, J. P. (2005). Perceptual organization of tone sequences in the auditory cortex of awake macaques. Neuron, 48, 139148. Miller, G. A., & Heise, G. A. (1950). The trill threshold. Journal of the Acoustical Society of America, 22, 637638. Miller, G. A., & Licklider, J. C. R. (1950). The intelligibility of interrupted speech. Journal of the Acoustical Society of America, 22, 167173. Milner, B., Branch, C., & Rasmussen, T. (1966). Evidence for bilateral speech representation in some nonrighthanders. Transactions of the American Neurological Association, 91, 306308. Moore, B. C. J., Glasberg, B. R., & Peters, R. W. (1985). Relative dominance of individual partials in determining the pitch of complex tones. Journal of the Acoustical Society of America, 77, 18531860. Moore, B. C. J., Glasberg, B. R., & Peters, R. W. (1986). Thresholds for hearing mistuned partials as separate tones in harmonic complexes. Journal of the Acoustical Society of America, 80, 479483. Narmour, E. (1990). The analysis and cognition of basic melodic structures: The implicationrealization model. Chicago, IL: University of Chicago Press. Narmour, E. (1999). Hierarchical expectation and musical style. In D. Deutsch (Ed.), The psychology of music (2nd ed., pp. 441472). San Diego, CA: Academic Press. Nakajima, Y., Minami, H., Tsumura, T., Kunisaki, H., Ohnishi, S., & Teranishi, R. (1991). Dynamic pitch perception for complex tones of periodic spectral patterns. Music Perception, 8, 291314. Nakajima, Y., Tsumura, T., Matsuura, S., Minami, H., & Teranishi, R. (1988). Dynamic pitch perception for complex tones derived from major triads. Music Perception, 6, 120. Patterson, R. D., Milroy, R., & Allerhand, M. (1993). What is the octave of a harmonically rich note? Contemporary Music Review, 9, 6981. Penfield, W., & Perot, P. (1963). The brains record of auditory and visual experience. Brain, 86, 595696. Petkov, C. I., OConnor, K. N., & Sutter, M. L. (2003). Illusory sound perception in macaque monkeys. Journal of Neuroscience, 23, 91559161. Petkov, C. I., OConnor, K. N., & Sutter, M. L. (2007). Encoding of illusory continuity in primary auditory cortex. Neuron, 54, 153165. Petkov, C. I., & Sutter, M. L. (2011). Evolutionary conservation and neuronal mechanisms of auditory perceptual restoration. Hearing Research, 271, 5465. Pollack, I. (1978). Decoupling of auditory pitch and stimulus frequency: the Shepard demonstration revisited. Journal of the Acoustical Society of America, 63, 202206. Pressnitzer, D., Sayles, M., Micheyl, C., & Winter, I. M. (2008). Perceptual organization of sound begins in the auditory periphery. Current Biology, 18, 11241128. Pujol, J., Deus, J., Losilla, J. M., & Capdevila, A. (1999). Cerebral lateralization of language in normal left-handed people studied by functional MRI. Neurology, 52, 10381043.

246

Diana Deutsch

Rahne, T., Bockmann, M., von Specht, H., & Sussman, E. S. (2007). Visual cues can modulate integration and segregation of objects in auditory scene analysis. Brain Research, 1144, 127135. Rasch, R. A. (1978). The perception of simultaneous notes such as in polyphonic music. Acustica, 40, 2233. Rasch, R. A. (1988). Timing and synchronization in ensemble performance. In J. A. Sloboda (Ed.), Generative processes in music: The psychology of performance, improvisation, and composition (pp. 7190). Oxford, U.K.: Oxford University Press. Rauschecker, J. P. (1998). Parallel processing in the auditory cortex of primates. Audiology and Neurootology, 3, 86103. Recanzone, G. H., Guard, D. C., Phan, M. L., & Su, T. K. (2000). Correlation between the activity of single auditory cortical neurons and sound localization behavior in the macaque monkey. Journal of Neurophysiology, 83, 27232739. Recanzone, G. H., & Sutter, M. L. (2008). The biological basis of audition. Annual Review of Psychology, 59, 119142. Remijn, G. B., Nakajima, Y., & Tanaka, S. (2007). Perceptual completion of a sound with a short silent gap. Perception, 36, 898917. Riecke, L., Mendelsohn, D., Schreiner, C., & Formisano, E. (2009). The continuity illusion adapts to the auditory scene. Hearing Research, 247, 7177. Risset, J.-C. (1969). Pitch control and pitch paradoxes demonstrated with computer-synthesized sounds. Journal of the Acoustical Society of America, 46, 88. Risset, J.-C. (1971). Paradoxes de hauteur: Le concept de hauteur sonore nest pas le meme pour tout le monde. Proceedings of the Seventh International Congress on Acoustics, Budapest, S10, 613616. Risset, J.-C. (1978). Paradoxes de hauteur (with sound examples). IRCAM Rep. 10 , Paris. Risset, J.-C. (1986). Pitch and rhythm paradoxes: Comments on Auditory paradox based on fractal waveform. Journal of the Acoustical Society of America, 80, 961962. Roberts, B., Glasberg, B. R., & Moore, B. C. J. (2002). Primitive stream segregation of tone sequences without differences in fundamental frequency or passband. Journal of the Acoustical Society of America, 112, 20742085. Roberts, B., Glasberg, B. R., & Moore, B. C. J. (2008). Effects of the build-up and resetting of auditory stream segregation on temporal discrimination. Journal of Experimental Psychology: Human Perception and Performance, 34, 9921006. Rogers, W. L., & Bregman, A. S. (1993). An experimental evaluation of three theories of auditory stream regulation. Perception & Psychophysics, 53, 179189. Rogers, W. L., & Bregman, A. S. (1998). Cumulation of the tendency to segregate auditory streams: Resetting by changes in location and loudness. Perception & Psychophysics, 60, 12161227. a ta nen, R. (1996). Neural mechanisms of the octave illusion: Ross, J., Tervaniemi, M., & Na electrophysiological evidence for central origin. Neuroreport, 8, 303306. Saffran, J. R., Johnson, E. K., Aslin, R. N., & Newport, E. L. (1999). Statistical learning of tone sequences by human infants and adults. Cognition, 70, 2752. Salzer, F. (1962). Structural hearing. New York, NY: Dover. Sandell, G. J., & Darwin, C. J. (1996). Recognition of concurrently-sounding instruments with different fundamental frequencies. Journal of the Acoustical Society of America, 100, 2683. Sasaki, T. (1980). Sound restoration and temporal localization of noise in speech and music sounds. Tohuku Psychologica Folia, 39, 7988.

6. Grouping Mechanisms in Music

247

Saupe, K., Koelsch, S., & Rubsamen, R. (2010). Spatial selective attention in a complex auditory environment such as polyphonic music. Journal of the Acoustical Society of America, 127, 472480. Scheffers, M. T. M. (1983). Sifting vowels: Auditory pitch analysis and sound segregation (Unpublished doctoral thesis). Groningen University, The Netherlands. Schenker, H. (1956). Neue musikalische theorien and phantasien: Der freie satz. Vienna, Austria: Universal Edition. Schenker, H. (1973). Harmony (O. Jonas, Ed. and annotator; E. M. Borgese, trans.). Cambridge, MA: MIT Press. Schroeder, M. R. (1986). Auditory paradox based on fractal waveform. Journal of the Acoustical Society of America, 79, 186189. Seeba, F., & Klump, G. M. (2009). Stimulus familiarity affects perceptual restoration in the European starling (Sturnus vulgaris). PLoS One, 4, e5974. Shamma, S. A., & Micheyl, C. (2010). Behind the scenes of auditory perception. Current Opinion in Neurobiology, 20, 361366. Shepard, R. N. (1964). Circularity in judgments of relative pitch. Journal of the Acoustical Society of America, 36, 23452353. Singh, P. (1987). Perceptual organization of complex tone sequences: a tradeoff between pitch and timbre? Journal of the Acoustical Society of America, 82, 886899. Sloboda, J. A. (1985). The musical mind. New York, NY: Clarendon (Oxford University Press). Smith, J., Hausfield, S., Power, R. P., & Gorta, A. (1982). Ambiguous musical figures and auditory streaming. Perception & Psychophysics, 32, 454464. Snyder, J. S., Alain, C., & Picton, T. W. (2006). Effects of attention on neuroelectric correlates of auditory stream segregation. Journal of Cognitive Neuroscience, 18, 113. Snyder, J. S., Carter, O. L., Lee, S.-K., Hannon, E. E., & Alain, C. (2008). Effects of context on auditory stream segregation. Journal of Experimental Psychology: Human Perception and Performance, 34, 10071016. Sonnadara, R. R., & Trainor, L. J. (2005). Perceived intensity effects in the octave illusion. Perception & Psychophysics, 67, 648658. Subirana, A. (1958). The prognosis in aphasia in relation to cerebral dominance and handedness. Brain, 81, 415425. Suga, N., & Ma, X. (2003). Multiparametric corticofugal modulation and plasticity in the auditory system. Neuroscience, 4, 783794. Sugita, Y. (1997). Neuronal correlates of auditory induction in the cat cortex. Neuroreport, 8, 11551159. Sussman, E., Gomes, H., Manette, J., Nousak, K., Ritter, W., & Vaughan, H. G. (1998). Feature conjunctions in auditory sensory memory. Brain Research, 793, 95102. Sussman, E., Ritter, W., & Vaughan, J. H. G. (1999). An investigation of auditory streaming effect using event-related brain potentials. Psychophysiology, 36, 2234. Sussman, E., Horvath, J., Winkler, I., & Orr, M. (2007). The role of attention in the formation of auditory streams. Perception & Psychophysics, 69, 136152. a ta nen, R., & Winkler, I. (2001). Changes in Takegata, R., Huotilainen, M., Rinne, T., Na acoustic features and their conjunctions are processed by separate neuronal populations. Neuroreport, 12, 525529. Tan, N., Aiello, R., & Bever, T. G. (1981). Harmonic structure as a determinant of melodic organization. Memory and Cognition, 9, 533539. Temperley, D. (2001). The cognition of basic musical structures. Cambridge, MA: MIT Press. Tenny, J., & Polansky, L. (1980). Temporal Gestalt perception in music. Journal of Music Theory, 24, 205241.

248

Diana Deutsch

Teranishi, R. (1982). Endlessly ascending/descending chords performable on a piano. Reports of the Acoustical Society of Japan, H6268. Thomson, W. (1999). Tonality in music: A general theory. San Marino, CA: Everett Books. Thompson, W. F., Hall, M. D., & Pressing, J. (2001). Illusory conjunctions of pitch and duration in unfamiliar tone sequences. Journal of Experimental Psychology: Human Perception and Performance, 27, 128140. Tian, B., Reser, D., Durham, A., Kustov, A., & Rauschecker, J. P. (2001). Functional specialization in rhesus monkey auditory cortex. Science, 292, 290293. Tougas, Y., & Bregman, A. S. (1985). Crossing of auditory streams. Journal of Experimental Psychology: Human Perception and Performance, 11, 788798. Tougas, Y., & Bregman, A. S. (1990). Auditory streaming and the continuity illusion. Perception & Psychophysics, 47, 121126. Treisman, A., & Gelade, A. (1980). A feature integration theory of attention. Cognitive Psychology, 12, 97136. Van Noorden, L. P. A. S. (1975). Temporal coherence in the perception of tone sequences (Unpublished doctoral dissertation). Technische Hogeschoel Eindhoven, The Netherlands. Vicario, G. (1960). Leffetto tunnel acustico. Revista di Psicologia, 54, 4152. Vicario, G. (1973). Tempo Psicologia ed Eventi. Florence, Italy: C.-E Giunti-G. Barbera. Vicario, G. (1982). Some observations in the auditory field. In J. Beck (Ed.), Organization and representation in perception (pp. 269283). Hillsdale, NJ: Erlbaum. Vliegen, J., Moore, B. C. J., & Oxenham, A. J. (1999). The role of spectral and periodicity cues in auditory stream segregation, measured using a temporal discrimination task. Journal of the Acoustical Society of America, 106, 938945. Vliegen, J., & Oxenham, A. J. (1999). Sequential stream segregation in the absence of spectral cues. Journal of the Acoustical Society of America, 105, 339346. Warren, J. D., Uppenkamp, S., Patterson, R. D., & Griffiths, T. D. (2003). Separating pitch chroma and pitch height in the human brain. Proceedings of the National Academy of Sciences USA, 100, 1003810042. Warren, R. M. (1983). Auditory illusions and their relation to mechanisms normally enhancing accuracy of perception. Journal of the Audio Engineering Society, 31, 623629. Warren, R. M., Obusek, C. J., & Ackroff, J. M. (1972). Auditory induction: perceptual synthesis of absent sounds. Science, 176, 11491151. Warren, R. M., Obusek, C. J., Farmer, R. M., & Warren, R. P. (1969). Auditory sequence: Confusions of patterns other than speech or music. Science, 164, 586587. Wertheimer, M. (1923). Untersuchung zur Lehre von der Gestalt II. Psychologische Forschung, 4, 30l350. Wessel, D. L. (1979). Timbre space as a musical control structure. Computer Music Journal, 3, 4552. Wilson, E. C., Melcher, J. R., Micheyl, C., Gutschalk, A., & Oxenham, A. J. (2007). Cortical fMRI activation to sequences of tones alternating in frequency: relationship to perceived rate and streaming. Journal of Neurophysiology, 97, 22302238. Winer, J. A. (2006). Decoding the auditory corticofugal systems. Hearing Research, 212, 18. th, J., & Balazs, L. (2005). Preattentive binding of Winkler, I., Czigler, I., Sussman, E., Horva auditory and visual stimulus features. Journal of Cognitive Neuroscience, 17, 320339. th, J., Ceponiene, R., Fellman, V., & Huotilainen, M., Winkler, I., Kushnerenko, E., Horva et al. (2003). Newborn infants can organize the auditory world. Proceedings of the National Academy of Sciences, 100, 1181211815. Zwicker, T. (1984). Experimente zur dichotischen Oktav- Tauschung [Experiments on the dichotic octave illusion]. Acustica, 55, 128136.

7 The Processing of Pitch


Combinations
Diana Deutsch
Department of Psychology, University of California, San Diego, La Jolla, California

I.

Introduction

In this chapter, we examine ways in which pitch combinations are processed by the perceptual system. We first inquire into the types of abstraction that give rise to the perception of local features, such as intervals, chords, and pitch classes. We also explore low-level abstractions that result in the perception of global features, such as contour. We next consider how combinations of features are further abstracted so as to give rise to perceptual equivalences and similarities. We discuss the roles played by basic, and probably universal, organizational principles in the perception of musical patterns, and the contributions made by stored knowledge concerning the statistical properties of music. We argue for the view that music is represented in the mind of the listener as coherent patterns that are linked together so as to form hierarchical structures. Other sections of the chapter are concerned with memory. We show how different aspects of musical tones are retained in parallel in separate memory systems, and that the output from these different systems is combined to determine memory judgments. We also consider the involvement of short-term memory for individual tones in our perception of tonal patterns. The final sections of the chapter concern a group of illusions that are produced by certain combinations of tones. These illusions have implications for individual differences in the perception of music, and for relationships between music and speech.

II.

Feature Abstraction

A. Octave Equivalence
A strong perceptual similarity exists between tones that are related by octaves; that is, whose fundamental frequencies stand in a ratio of 2:1. Octave equivalence is implied in the music of many different cultures (cf. Nettl, 1956). In the Western
The Psychology of Music. DOI: http://dx.doi.org/10.1016/B978-0-12-381460-9.00007-9 2013 Elsevier Inc. All rights reserved.

250

Diana Deutsch

musical scale, tones that stand in octave relation are given the same name, so that a tone is specified first by its position within the octave and then by the octave in which it occurs (D2, Fx3, and so on). In one version of Indian musical notation, a tone is represented by a letter to designate its position within the octave, together with a dot or dots to designate its octave placement. Various observations related to octave equivalence have been reported. For example, listeners with absolute pitch may sometimes place a note in the wrong octave, even though they name it correctly (Bachem, 1955; Lockhead & Byrd, 1981; Miyazaki, 1989). Generalization of response to tones standing in octave relation has been found in human adults (Humphreys, 1939) and infants (Demany & Armand, 1984), as well as in animals (Blackwell & Schlosberg, 1943). Further, interference and consolidation effects in memory for pitch exhibit octave generalization (Deutsch, 1973b; Deutsch & Lapidis, in preparation). Given that tones standing in octave relation are in a sense perceptually equivalent, it has been suggested that pitch should be treated as a bidimensional attribute; the first dimension representing overall pitch level (pitch height) and the second dimension defining the position of the tone within the octave (tone chroma or pitch class) (Bachem, 1955; Deutsch, 1969, 1973b; Deutsch, Dooley, & Henthorn, 2008; Deutsch, Kuyper & Fisher, 1987; Patterson, 1986; Pickler, 1966; Risset, 1969; Ruckmick, 1929; Shepard, 1964, 1982; Ueda & Ohgushi, 1987; Warren, Uppenkamp, Patterson, & Griffiths, 2003). This is discussed in detail later.

B. Perceptual Equivalence of Intervals and Chords


When two tones are presented either simultaneously or in succession, there results the perception of a musical interval, and intervals are perceived as the same in size when the fundamental frequencies of their component tones stand in the same ratio. This principle forms a basis of the traditional musical scale. The smallest unit of this scale is the semitone, which corresponds to a frequency ratio of approximately 1:1.06. Tone pairs that are separated by the same number of semitones are given the same name, such as major third, minor sixth, and so on. Chords consisting of three or more tones are also classified in part by the ratios formed by their components. However, a simple listing of these ratios is not sufficient to define a chord. For instance, major and minor triads are perceptually quite distinct, yet they are both composed of a major third (five semitones), a minor third (four semitones), and a perfect fifth (seven semitones). So it is of perceptual importance that the minor third lies above the major third in the major triad, and below it in the minor triad; this needs to be taken into account in considering how chords might be abstracted by the nervous system. Given the principles of octave and interval equivalence, one might hypothesize that the perceptual equivalence of intervals would persist if their component tones were placed in different octaves. This assumption has frequently been made by contemporary music theorists, who describe such intervals as in the same interval class. Traditional music theory assumes that such equivalence holds for simultaneous intervals. Those whose components have reversed their positions along the height

7. The Processing of Pitch Combinations

251

dimension are treated as harmonically equivalent (Piston, 1948/1987), and we easily recognize root progressions of chords in their different instantiations. Plomp, Wagenaar, and Mimpen (1973) and Deutsch and Roll (1974) have provided evidence for the perceptual similarity of harmonic intervals that are related by inversion. For successive intervals, however, it appears that interval class is not perceived directly, but rather through a process of hypothesis confirmation, in which the features that are directly apprehended are pitch class and interval (Deutsch, 1972c). Deutsch (1969) proposed a neural network that would accomplish the abstraction of low-level pitch relationships so as to produce basic equivalences found in music perception. The model is based on findings concerning the abstraction of low-level features in vision, such as orientation and angle size (Hubel & Wiesel, 1962). The hypothesized neural network consists of two parallel channels, along each of which information is abstracted in two stages. An outline of this model is shown in Figure 1. The first channel mediates the perceptual equivalence of intervals and chords under transposition. In the first stage of abstraction along this channel, firstorder units that respond to tones of specific pitch project in groups of two or three onto second-order units, which in consequence respond to specific intervals and chords, such as (C4, E4, G4) or (D5, G5). It is assumed that such linkages occur only between units underlying pitches that are separated by an octave or less. In the second stage of abstraction along this channel, second-order units project onto third-order units in such a way that second-order units activated by tones standing in the same relationship project onto the same unit. So, for example, all units activated by an ascending interval of four semitones (a major third) project onto one unit, all those activated by a descending interval of seven semitones (a perfect fifth)

P R I M A R Y A R R A Y

SPECIFIC INTERVALS AND CHORDS

TRANSPOSABLE INTERVALS AND CHORDS

TRANSPOSITION CHANNEL

PITCH CLASSES

INVERTIBLE CHORDS

OCTAVE EQUIVALENCE CHANNEL

Figure 1 Model for the abstraction of pitch relationships. Pitch information is abstracted along two parallel channels; one mediating transposition and the other mediating octave equivalence. Adapted from Deutsch (1969). 1969 by the American Psychological Association. Adapted with permission.

252

Diana Deutsch

project onto a different unit, all those activated by a major triad project onto yet a different unit, and so on (Figure 2). The second channel mediates the perceptual equivalence of tones that stand in octave relation. In the first stage of abstraction along this channel, first-order units that respond to tones of specific pitch project onto second-order units in such a way that those standing in octave relation project onto the same unit. These second-order units then respond to tones in a given pitch class, regardless of the octave in which they occur, so can be termed pitch class units. In the second stage of abstraction along this channel, second-order units project in groups of two or three onto third-order units, which in consequence respond to combinations of pitch classes. Such units therefore mediate the perceptual similarity of intervals and chords that are related by inversion (Figure 3). This level of convergence is assumed to occur only for units that are activated by simultaneously presented tones. The general type of architecture proposed by this model has been found in mammalian auditory systems. Neurons have been found that act as AND gates, as hypothesized for the transposition channel, and others as OR gates, as hypothesized for the pitch class channel. In addition, the physiological evidence has shown that
C C D D E F F G G A A B C C D D E F F G

Primary array

Specific intervals and chords Abstracted intervals and chords

Figure 2 Two stages of abstraction along the transposition channel. Adapted from Deutsch (1969). 1969 by the American Psychological Association. Adapted with permission.

C C D D E F F GG

AA

B CC DD E FF

GG AA

B C C D D E F F G G A A B Primary array

B A

Abstracted octave D

E Invertible chords

Figure 3 Two stages of abstraction along the octave-equivalence channel. Adapted from Deutsch (1969). 1969 by the American Psychological Association. Adapted with permission.

7. The Processing of Pitch Combinations

253

many auditory analyses are carried out in parallel subsystems, each of which is organized in hierarchical fashion (Knudsen, du Lac, & Esterly, 1987; Patterson, Uppenkamp, Johnsrude, & Griffiths, 2002; Schreiner, 1992; Suga, 1990; Sutter & Schreiner, 1991; Wessinger, VanMeter, Tian, Van Lare, Pekar, & Rauschecker, 2001). With respect specifically to interval identification, Suga, ONeil, and Manabe (1979) have described neurons in the auditory cortex of the bat that showed facilitation when the second harmonic of a tone was delivered simultaneously with the third harmonic, so that the combination formed a perfect fifth. Other units showed facilitation when the second and fourth harmonics were simultaneously presented, so that the combination formed an octave; yet others showed facilitation when the third and fourth harmonics were simultaneously presented, so that the combination formed a perfect fourth. Such units often responded poorly to single tones in isolation, but strongly and consistently when the appropriate tonal combination was presented. On the present model, units with such characteristics are hypothesized to occur at the first stage of abstraction along the transposition channel. With respect to the pitch class channel, Evans (1974) found neurons in the auditory cortex of the cat that exhibited peaks of sensitivity at more than one band of frequencies, and peaks spaced at octave intervals were commonly found. Also Suga and Jen (1976) noted the presence of neurons in the bat auditory cortex that showed two peaks of sensitivity that were approximately harmonically related. Ross, Choi, and Purves (2007) hypothesized that the intervals of Western tonal music have a special status, resulting from our constant exposure to speech sounds. The authors analyzed a database of spoken English vowels and found that, expressed as ratios, the frequency relationships between the first two formants in vowel phones represent all 12 intervals in the chromatic scale. It is intriguing to hypothesize, therefore, that through extensive exposure to speech sounds, higherorder connections are formed between lower-order units in such a way as to emphasize those units that feature the 12 chromatic intervals. Bharucha (1987, 1999) has hypothesized a more elaborate neural network, whose basic architecture has features similar to those proposed by Deutsch (1969). The model assumes that such feature detectors develop as a result of passive exposure to the music of our tradition, and it is discussed further in Chapter 8.

C. Interval Class
When different two-tone combinations form the same interval by appropriate octave displacement, these combinations are held to be in the same interval class. For example, C3 paired with D5, form the same interval class as G2 paired with F6. As noted earlier, the conditions under which interval class forms a basis for perceptual equivalence are complex ones. Experimental evidence for such equivalence has been obtained for simultaneous intervals, as mentioned earlier (Deutsch & Roll, 1974; Plomp et al., 1973). For successive intervals, however, the issue is complicated. If interval class were indeed a perceptual invariant, we should have no difficulty in recognizing a melody when its component tones are placed

254

Diana Deutsch

haphazardly in different octaves. As a test of this prediction, Deutsch (1972c) generated the first half of the tune Yankee Doodle in different versions. First, it was produced without transformation in each of three adjacent octaves. Second, it was generated in such a way that each tone was in its correct position within the octave (i.e., the interval classes were preserved) but the octave placement of the tones varied haphazardly across the same three octaves. Third, the tune was generated as a series of clicks, so that the pitch information was removed entirely but the rhythm remained. The different versions of the tune were played to separate groups of subjects, who were given no clues as to its identity other than being assured that it was well known. Although the untransformed melody was universally recognized, the scrambled-octaves version was recognized no better than the version in which the pitch information was removed entirely. However, when the subjects were later given the name of the tune, and so knew what to listen for, they were able to follow the scrambled-octaves version to a large extent. This shows that the subjects were able to use pitch class to confirm the identity of the tune, although they had been unable to recognize it in the absence of cues on which to base a hypothesis. (This brief experiment is presented on the CD by Deutsch, 1995). This experiment shows that perception of interval class, where successions of tones are concerned, requires the involvement of an active, top-down process, in which the listener matches each tone as it arrives with his or her image of the expected tone. On this line of reasoning, the extent to which listeners perceive interval class depends critically on their knowledge and expectations. Other experimental findings have further indicated that interval class is not directly apprehended where successions of tones are concerned. Deutsch (1979) presented listeners with a standard six-tone melody, followed by a comparison melody. The comparison melody was always transposed four semitones up from the standard. On half the trials, the transposition was exact, and on the other half, two of the tones in the transposed melody were permuted, while the melodic contour was unchanged. There were four conditions in the experiment. In the first, the standard melody was played once, followed by the comparison melody. In the second, the standard melody was repeated six times before presentation of the comparison melody. In the third condition, the standard melody was again repeated six times, but now on half of the repetitions it was transposed intact an octave higher, and on the other half it was transposed intact an octave lower, so that the intervals forming the melody were preserved. In the fourth condition, the standard melody was again repeated six times, but now on each repetition the individual tones were placed alternately in the higher and lower octaves, so that the interval classes were preserved, but the intervals themselves were altered. Exact repetition of the standard melody resulted in a substantial improvement in recognition performance, and an improvement also occurred when the standard melody was repeated intact in the higher and lower octaves. However, when the standard melody was repeated in such a way that its tones alternated between the higher and lower octaves, performance was significantly poorer than when it was not repeated at all. This experiment provides further evidence that interval class

7. The Processing of Pitch Combinations

255

cannot be considered a first-order perceptual feature. Repeating a set of intervals resulted in memory consolidation for these intervals; however, repeating a set of interval classes did not do so. Deutsch and Boulanger (1984) further addressed this issue by presenting musically trained subjects with novel melodic patterns, which they recalled in musical notation. As shown in the examples in Figure 4, each pattern consisted of a haphazard ordering of the first six notes of the C-major scale. In the first condition, all the tones were taken from a higher octave; in the second, they were all taken from a lower octave. In the third condition, the individual tones alternated between these two octaves, so that roughly two thirds of the intervals formed by successive tones spanned more than an octave. The percentages of tones that were correctly notated in the correct serial positions in these different conditions are also shown in Figure 4, and it can be seen that performance in the third condition was substantially poorer than in the other two. The findings from these three experiments are in accordance with the two-channel model of Deutsch (1969), which assumes that neural linkages underlying the abstraction of successive intervals occur only between units responding to pitches that are separated by no more than an octave. It is interesting in this regard to consider the use of octave jumps in traditional music. On the present line of reasoning, such jumps can be made with impunity, provided the musical setting is such that the octave-displaced tone is anticipated by the listener. We should therefore expect that octave jumps would tend to be limited to such situations. Indeed, this appears to be the case. For example, a melodic line may be presented several times without transformation. A clear set of expectations having been established, a jump to a different octave occurs. The passage in Figure 5a, for instance, occurs after the melody has been presented several times without octave jumps. Interval class can also be successfully invoked when the harmonic structure is clear and unambiguous, so that again the displaced tones are highly probable. This is illustrated in the segment in Figure 5b.

Condition Higher octave

Correct notations 62.7%

Lower octave

67.5%

Both octaves

31.8%

Figure 4 Examples of sequences used in different conditions of the experiment on the effect of octave jumps on recall of melodic patterns. At the right are shown the percentages of tones that were correctly recalled in the correct serial positions in the different conditions. Adapted from Deutsch and Boulanger (1984). 1984 by the Regents of the University of California.

256

Diana Deutsch

Figure 5 Two examples of octave jumps in traditional Western music. Here the jumps are readily processed. (a) From Beethoven, Rondo in C, Op. 5, No. 1; (b) from Beethoven, Sonata in C minor, Op. 10, No. 1.

The technique of 12-tone composition uses very frequent octave jumps, and this raises the question of whether the listener does indeed identify as equivalent different instantiations of the same tone row under octave displacement. Given the evidence and arguments outlined earlier, such identification should be possible in principle, but only if the listener is very familiar with the material, or if its structure is such as to give rise strongly to the appropriate expectations (see also Meyer, 1973; Thomson, 1991).

D. Contour
We use global as well as specific cues in recognizing music. Such cues include, for example, overall pitch range, the distribution of interval sizes, and the relative proportions of ascending and descending intervals. Melodic contour plays a particularly important role here. As shown in Figure 6, melodies can be represented by their distinctive contours, even when their interval sizes are altered. One line of experimentation involving contour was initiated by Werner (1925). He reported that melodies could be recognized when they were transformed onto scales in which the octave was replaced by a different ratio, such as a fifth or two octaves, with these micro- or macro-octaves being divided into 12 equal intervals, so producing micro- or macro-scales. Later, Vicario (1983) carried out a study to determine how well listeners were able to recognize well-known melodies that had been transformed in this fashion. The results of this study are shown in Figure 7. As can be seen, although listeners were able to recognize such distorted melodies to some extent, the distortions nevertheless impaired melody recognition, with the amount of impairment being a function of the degree of expansion or compression of the octave. In another experiment, White (1960) found that listeners could recognize melodies to some extent when all the intervals were set to one semitone, so that only the sequence of directions of pitch change remained. Performance was enhanced when

7. The Processing of Pitch Combinations

257

(a)

(b)

Figure 6 Contours from Beethoven piano sonatas as represented by Schoenberg: (a) from Sonata in C minor, Op. 10/I-III; (b) from Sonata in D, Op.10/3-III, mm. 116. From Schoenberg (1967).

100 % Correct recognition

50

Untrained listeners Trained listeners 1 Semitone % Compression n 0 2 Octaves % Enlargement

Figure 7 Percent correct recognition of melodies that have been transformed by compressing or enlarging the octave to differing extents. Adapted from Vicario (1983).

the relative sizes of the intervals were retained, but their absolute sizes were altered. Further studies have confirmed that contour can serve as a salient cue to melody recognition (see, e.g., Croonen, 1994; Dowling, 1978; Dowling & Fujitani, 1971; Edworthy, 1985; Idson & Massaro, 1978; and Kallman & Massaro, 1979). Further research has examined the cues that we use in judging similarity of contour. In much early work, contour was defined simply as the pattern of rises and falls in pitch, considering only temporally adjacent notes (cf. Dowling, 1978; Idson & Massaro, 1978). However, recent theoretical work has been concerned both with relationships between temporally adjacent notes and also with larger-scale features of contour (Marvin & LaPrade, 1987; Polansky & Bassein, 1992; Quinn, 1997). In an investigation of the relative salience of these two aspects of contour, Quinn (1999) constructed pairs of melodies that were either equivalent in note-to-note contour but not in the relationships between each note and the other notes in the melody, equivalent according to both criteria, or not equivalent according to either criterion. The subjects rated the degree of similarity between the members of each

258

Diana Deutsch

pair of melodies. The ratings indicated that note-to-note equivalence of contour played a primary role in similarity judgment, but that relationships between nonadjacent notes also had an influence. Schmuckler (1999, 2004, 2009) adopted an alternative approach to contour perception. He characterized contour in terms of the relative degrees of strength of its cyclic information, as quantified by Fourier analysis. Schmuckler (2010) produced some interesting experimental support for this approach, though more findings are needed to evaluate it in detail.

E. Pitch Organization in Melody


We now turn to the related question of how listeners organize pitches so as to perceive coherent melodic phrases. As described in Chapter 6, pitch proximity is a powerful organizing principle in melody: We tend to group together tones that are close in pitch, and to separate out those that are further apart. When tones are presented at a rapid tempo, and these are drawn from two different pitch ranges, the listener perceives two melodic streams in parallel, one corresponding to the lower tones and the other to the higher onesa phenomenon termed stream segregation (Bregman, 1990). However, pitch proximity also operates to group together tones when stream segregation does not occur. Hamaoui and Deutsch (2010) presented subjects with sequences of tones at interonset intervals of roughly 300 ms. The basic pattern consisted of a sequence of 12 tones that ascended or descended in semitone steps. Pitch distances of 2, 5, and 11 semitones were inserted between every three or four tones, and the subjects reported whether they heard the sequence as grouped into units of three or four tones each. When the sequences were isochronous, grouping by pitch proximity always occurred with the insertion of 5- or 11-semitone distances between successive tones, and such grouping even occurred to a statistically significant extent with the insertion of 2-semitone distances. Grouping by pitch proximity is associated with substantial processing advantages. In a study by Deutsch (1978a), listeners compared the pitches of two tones that were separated by a sequence of intervening tones. As shown later in Figure 23, the smaller the average interval size formed by the intervening tones, the lower the error rate in judging whether the test tones were the same or different in pitch (see also Deutsch, 1974). Using a different paradigm, Aarden (2003) had subjects listen to folksong melodies. When each tone was played, subjects responded whether it was higher, lower, or identical in pitch to the previous tone. It was found that the closer the successive tones were in pitch, the faster were the subjects reaction times. The cognitive advantage conferred by smaller melodic intervals may account, at least in part, for the finding that in many cultures the frequency of occurrence of a melodic interval decreases as a function of its size. This has been shown in melodies from Africa, America, Asia, and Europe (Dowling, 1967; Huron, 2001; Merriam, Whinery & Fred, 1956; Ortmann, 1926). Further, in an analysis of melodic

7. The Processing of Pitch Combinations

259

intervals in more than 4,000 folk songs, the average interval size formed by tones within phrases was 2.0 semitones, whereas that between tones at the end of one phrase and the beginning of the next was 2.9 semitones (Huron, 2006). This last finding indicates that smaller intervals serve to produce coherent relationships between tones within a phrase, and that larger intervals serve to separate out tones that cross phrase boundaries. Repetition is also an important factor. We can note that there is a cross-cultural tendency for musical phrases to contain one or more tones that are repeated more often than others. From an experimental perspective, Deutsch (1970a, 1972a, 1975a) had listeners compare the pitches of two tones that were separated by a sequence of intervening tones, and found that repetition of the first test tone resulted in considerable memory enhancement for that tone (see also Section IV). Given these findings, we should expect that phrases containing repeated tones would be better remembered, and that the more often a tone is repeated, the better this tone would be remembered, so the greater would be its influence on the organization of the entire phrase. So when we consider these two low-order effects together (i.e., grouping by pitch proximity and memory enhancement through repetition), we can see that there a considerable processing advantage is to be gained from a system in which there are a limited number of anchor toneswhich are well remembered through repetitionsurrounded by satellite tones that are linked to these anchor tones by pitch proximity. As argued by Deutsch (1982b), these two low-order effects acting together may well have influenced the development of musical systems across cultures. Erickson (1984) and Kim (2011) have also argued that such a principlewhich Erickson termed melodic tonal centeringis a universal and possibly innate characteristic of tonal organization, which is not bound to any particular musical culture or historical period. A similar argument has been made by Thomson (1999, 2006) who proposed that melodies in different cultures share a type of organization that he termed tonality frames, in which certain pitches serve as anchors in defining the pitch ranges of tones within melodies. Another cross-cultural tendency was documented by Vos and Troost (1989) in an analysis of samples of music from Western classical composers, and from European and African-American folk songs. These authors found that large melodic intervals were more likely to ascend and small intervals to descend. Huron (2006) later extended these findings to samples of music from Australia, Asia, and Africa. Meyer (1956) and Narmour (1990) have proposed that when presented with a melodic interval of small size, listeners expect to hear a further melodic interval that moves in the same direction. Evidence for this conjecture was obtained by Von Hippel (2002) in a study of anticipation judgments, and by Aarden (2003) in a reaction time study. For the case of large intervals, music theorists have observed that these generally tend to be followed by a change in directiona tendency referred to as post-skip reversal. Watt (1924), in analyses of Schubert lieder and Ojibway songs, found that as the size of an interval increased,

260

Diana Deutsch

the probability increased that the next interval would move in the opposite direction. Later, Von Hippel and Huron (2000) extended Watts finding to traditional European, Chinese, South African, and Native American folk songs. Interestingly, Han, Sundararajan, Bowling, Lake, and Purves (2011) found that changes in pitch direction occurred more frequently, and intervals tended to be larger, in samples of music from tone language cultures than from nontone language cultures. And as expected, there were also more frequent changes in pitch direction and larger intervals in speech samples from tone language cultures. However, the general cross-cultural findings of post-skip reversal, and the prevalence of steps rather than skips in melodies, still held in this study. The reason for the tendency for post-skip reversal has been a matter of debate. Meyer (1973) proposed that this occurs because listeners want to hear the gap produced by the large interval as filled with pitches lying within the gap. Von Hippel (2000) and Von Hippel and Huron (2000) later advanced an alternative explanation in terms of regression to the mean. Sampling a large number of melodies, they observed that pitches in most melodies formed a normal distribution, so that those in the center of a melodys range occurred most frequently, and the probability that a particular pitch would occur decreased with an increase in its distance from the center of the range. They argued, therefore, that most large intervals take a melody to an extreme of its range, creating the likelihood that the next pitch would be closer to the center. They obtained evidence for this view in a study of several hundred melodies from different cultures and periods. Interestingly, though, they also foundin line with Meyers conjecturethat listeners expected large intervals to be followed by a change in direction, regardless of the location of the pitches relative to the center of the distribution. The perceptual tendencies explored so far are related to Narmours (1990, 1992) implication-realization model of musical expectations. Narmour proposed that listeners bring to their perception of melodies a number of expectations based on universal, and possibly innate, principles of music perception and cognition. One basic principle proposed by Narmour is that listeners expect small intervals to be followed by continuations in the same direction, and large intervals to be followed by a directional change. As another basic principle, Narmour proposed that listeners expect a small interval to be followed by one that is similar in size, and a large interval to be followed by one of smaller size. Narmours principles have been the subject of substantial investigation (Cuddy & Lunny, 1995; Pearce & Wiggins, 2006; Schellenberg, 1996, 1997; Schmuckler, 1989; Thompson & Stainton, 1998), and considerable supporting evidence for them has been obtained. Variations of Narmours model have also been proposed. For example, Schellenberg (1997) proposed a two-factor model of musical expectations involving pitch proximity and pitch reversal; further, to account for more global expectations, he extended the principle of proximity to noncontiguous tones. Another important principle is the involvement of tonal schemata (Gjerdingen, 1988, 2007; Meyer, 1973). Certain musical patterns are prominent in works

7. The Processing of Pitch Combinations

261

composed in particular musical styles, and these musical schemata and archetypes influence memory and perception of music in listeners who are familiar with the appropriate style (see also Kim, 2011). In considering overall pitch relationships within phrases, two types of structure appear to occur quite commonly. Sachs (1962) has noted that in certain cultures and contexts, melodies are dominated by phrases that begin with a large ascending interval, and continue with a series of tones that descend in stepwise fashion. He termed these melodies tumbling strains, and noted that they tend to occur, for example, in East European laments. A tendency has also been noted for phrases to rise and then fall in pitch, producing an arch-shaped contour. Huron (1996), in an analysis of phrases taken from more than 6,000 European folk songs, found that more than 40% of the analyzed phrases followed this pattern. It is interesting to relate tumbling strains and melodic arch patterns to paralinguistic utterances, and to pitch patterns in exaggerated speech. Tumbling strains in laments bear a resemblance to wails that are produced in extreme distress, and may well derive in part from these. Also, both these contours bear strong resemblances to the exaggerated pitch patterns employed by mothers in communicating with preverbal infantsa form of speech termed motherese. For example, mothers use falling pitches to sooth distressed infants, and they use steep archshaped contours to express approval or praise, as in saying Go-o-od girl! Interestingly, these particular speech patterns occur in many different languages and cultures. Babies tend to respond appropriately even though they do not yet understand speech, even to phrases that are spoken in a foreign language (Fernald, 1993). We may then surmise that arch-shaped and falling pitch contours in music are related to a primitive and perhaps universal desire to produce such patterns in appropriate situations, and to a primitive impulse to respond to them.

III.

Abstraction of Higher-Order Shapes

We next inquire into how higher-order abstractions are derived so as to lead to perceptual equivalences and similarities. We recognize visual shapes when these differ in size, position in the visual field, and to some extent in orientation. What transformations result in analogous equivalences in music? Theorists have long drawn analogies between perception of pitch relationships and relationships in visual space (Helmholtz, 1859/1954; Koffka, 1935; Mach, 1906/1959). In contrast to visual space, however, pitch was conceived as represented along one dimension only. As Mach (1906/1959) wrote:
A tonal series is something which is an analogue of space, but is a space of one dimension limited in both directions and exhibiting no symmetry like that, for instance of a straight line running from right to left in a direction perpendicular to the median plane. It more resembles a vertical right line. . .

262

Diana Deutsch

Several investigators have shown that auditory analogues of visual grouping phenomena may be created by mapping one dimension of visual space into log frequency and the other into time (Bregman, 1990; Deutsch, 1975b; Van Noorden, 1975). The principle of proximity emerges clearly, for example, in the visual representation of the sequence shown in Figure 4 of Chapter 6. We may therefore inquire whether other perceptual equivalences in vision have analogues in the perception of music.

A. Transposition
Von Ehrenfels (1890), in his influential paper on form perception, pointed out that when a melody is transposed it retains its essential form, the Gestaltqualitat, provided the relations among the individual tones are preserved. In this respect, he argued, melodies are similar to visual shapes; these retain their perceptual identities when they are translated to different locations in the visual field. A number of factors influence the extent to which a transposed and slightly altered melody is judged as similar to the original one. For example, when the original and transposed melodies can be interpreted as in the same key, and the successive tones comprising the melodies form the same number of steps along the diatonic scale, the melodies are generally judged as very similar to each other. This holds true whether or not the intervals forming the melodies are the same (Bartlett & Dowling, 1980; Dewitt & Crowder, 1986; Dowling, 1978, 1986; Takeuchi & Hulse, 1992; Van Egmond & Povel, 1994a, b), and can be taken to reflect the projection of pitch information onto overlearned alphabets, as proposed in the model of Deutsch and Feroe (1981) to be described later, and illustrated later in Figures 10 and 11. Several researchers have hypothesized that the extent to which a transposed melody is perceived as related to the original one is influenced by the key distance between them. Key distance is defined in terms of distance along the cycle of fifths. So, for example, a melody that has been transposed from C major to G major is held to be more related to the original melody than one that has been transposed from C major to Fx major (see, e.g., Bartlett & Dowling, 1980; Cuddy, Cohen, & Mewhort, 1981; Dowling, 1991; Dowling & Bartlett, 1981; Takeuchi & Hulse, 1992; Trainor & Trehub, 1993; Van Egmond & Povel, 1994a, 1994b). Key distance has been found to affect melody recognition in complex ways (Dowling, 1991; Takeuchi & Hulse, 1992; Van Egmond & Povel, 1994b), and explanations for the obtained findings have been strongly debated (Dowling, 1991; Takeuchi & Hulse, 1992; Takeuchi, 1994; Van Egmond & Povel, 1994a). An important point here is that the closer two keys stand along the cycle of fifths, the larger the overlap of their pitch classes. For example, the C-major scale consists of pitch classes (C, D, E, F, G, A, B) and the G-major scale consists of pitch classes (G, A, B, C, D, E, Fx); these two scales therefore share six out of seven pitch classes. However, the Fx-major scale consists of (Fx, Gx, Ax, B, Cx, Dx, F); so the C-major and Fx-major scales share only two out of

7. The Processing of Pitch Combinations

263

seven pitch classes. As described in Section IV, repetition of a pitch or pitch class strongly enhances its representation in short-term memory (Deutsch, 1970a, 1972a, 1975a). So when two melodies are presented in a short-term setting, and these are related by transposition, the salience of the tones in near-key transpositions should be considerably enhanced relative to those in far-key transpositions. As a further short-term memory effect, when two tones are compared for pitch, and these are separated by a sequence of intervening tones, including in the intervening sequence a tone that is a semitone removed from the first test tone produces an increase in errors. Further, presenting two tones in the intervening sequence, one a semitone higher than the first test tone and the other a semitone lower, produces a substantial increase in errors (Deutsch, 1973a, 1973b, 1975c; Deutsch & Feroe, 1975; see also Section IV). Now when the C-major scale is presented followed by the G-major scale (a near-key transposition), only one of the seven tones of the G-major scale is preceded by tones that are both a semitone above and a semitone below itnamely, the tone Fx. However, when the C-major scale is presented followed by the Fx-major scale (a far-key transposition), five of the seven tones of the Fx-major scale are preceded by tones that are both a semitone above and a semitone below themnamely, the tones Fx, Gx, Ax, Cx, and Dx. So for far-key transpositions, tones are subject to a larger amount of short-term memory interference than are near-key transpositions. This difference in amount of interference should differentially affect comparison judgments of melodies that are related by near and far keys. Key distance effects have also been invoked for triads; for example, the C-major triad is considered more related to the G-major triad than to the Fx-major triad. Experiments exploring these effects have generally employed the following paradigm: A prime context consisting of a chord or a sequence of chords is followed by a target chord, and subjects make a perceptual judgment on the target chordsuch as an intonation or temporal asynchrony judgment. Targets have been found to be better processed when they were preceded by a harmonically related prime than when they were preceded by a less related prime (Bharucha & Stoeckig, 1986, 1987; Bigand, Tillmann, Poulin-Charronat, & Manderlier, 2005; Justus & Bharucha, 2002; Tillmann & Bharucha, 2002; Tillmann, Bigand & Pineau, 1998; Tillmann & Lebrun-Guillaud, 2006). These findings are also equivocal in their interpretation. Although they have generally been attributed to acquired knowledge concerning chord progressions, short-term effects of repetition and interference could have played a role. Some complex effects of repetition have been found (see, e.g., Tekman & Bharucha, 1998); however, such effects have frequently not been controlled for, and there has been no control for specific effects of memory interference. For example, the C-major (C, E, G) and G-major (G, B, D) triadswhich are considered closely relatedhave a tone in commonnamely, G; further, only one pair of tones across these triads stand in semitone relationnamely, C and B. On the other hand, the C-major (C, E, G) and B-major (B, Dx, Fx) triadswhich are considered unrelatedhave no tones in common, and all three pairs of tones across these triads stand in semitone relationnamely,

264

Diana Deutsch

C and B, E and Dx, and G and Fx. So although it is reasonable to hypothesize that harmonic priming effects could be based on acquired knowledge of abstract relationships in tonal music, it is unclear to what extent these effects result from such acquired knowledge, and to what extent short-term memory effects are responsible. Other factors have also been found to influence the similarity of transposed melodies. For example, several researchers have observed that the closer two ` s, melodies are in pitch range, the greater their perceived similarity (France 1958/1988; Hershman, 1994; Van Egmond & Povel, 1994b, Van Egmond, Povel, & Maris, 1996). In addition, the coding model of Deutsch and Feroe (1981) has been used successfully as a predictor of perceived similarity between transposed melodies (Van Egmond & Povel, 1996), as described in Section III,D.

B. Inversion and Retrogression


We may next inquire whether further equivalences can be demonstrated for musical shapes that are analogous to their visuospatial counterparts. Schoenberg (1951) argued that transformations similar to rotation and reflection in vision result in perceptual equivalences in music also. He wrote:
The unity of musical space demands an absolute and unitary perception. In this space . . . there is no absolute down, no right or left, forward or backward . . . Just as our mind always recognizes, for instance, a knife, a bottle or a watch, regardless of its position, and can reproduce it in the imagination in every possible position, even so a musical creators mind can operate subconsciously with a row of tones, regardless of their direction, regardless of the way in which a mirror might show the mutual relations, which remain a given quantity.

This statement may be compared with Helmholtzs (1844) description of imagined visuospatial transformations:
Equipped with an awareness of the physical form of an object, we can clearly imagine all the perspective images which we may expect upon viewing it from this or that side. (see Warren & Warren, 1968, p. 252)

On this basis, Schoenberg proposed that a row of tones may be recognized as equivalent when it is transformed in such a way that all ascending intervals become descending ones, and vice versa (inversion), when it is presented in reverse order (retrogression), or when it is transformed by both these operations (retrograde-inversion). Figure 8 illustrates Schoenbergs use of his theory in compositional practice. As Schoenberg (1951) wrote:
The employment of these mirror forms corresponds to the principle of the absolute and unitary perception of musical space.

7. The Processing of Pitch Combinations

265

2 3 4

7 8 9 10 11 12

12 11

10 9

8 7

5 4

3 2

Basic Set

Retrograde Set

Inversion

Retrograde Inversion

Figure 8 Schoenbergs illustration of his theory of equivalence relations between pitch structures, taken from his Wind Quartet, Op. 26. From Schoenberg (1951).

Schoenberg did not conceive of the vertical dimension of musical space simply as pitch, but rather as pitch class. His assumptions of perceptual equivalence under transposition, retrogression, inversion, and octave displacement are fundamental to 12-tone composition (Babbitt, 1960). In this procedure, a given ordering of the 12 tones within the octave is adopted. The tone row is repeatedly presented throughout the piece; however, the above transformations are allowed on each presentation, and it is assumed that the row is perceived as an abstraction in its different manifestations. Whether such transformations indeed result in perceptual equivalence is debatable. In the visual case, we must have evolved mechanisms that preserve the perceptual identities of objects regardless of their orientation relative to the observer. An analogous ecological argument cannot be made for inversion and retrogression of sound patterns. A second doubt is based on general experience. Sound sequences

266

Diana Deutsch

often become unrecognizable when they are reversed in time, as we can confirm by attempting to decode a segment of speech when it is played backward. Furthermore, many inverted three-note combinations are perceptually very dissimilar to the combinations from which they are derived. For example, a minor triad is an inversion of a major triad, yet the two are perceptually quite distinct from each other. It would appear, therefore, that when inverted and retrograde patterns are recognized, this is accomplished at a level of abstraction that is equivalent to the one that allows us to recite a segment of the alphabet backwards or to invert a series of numbers (Deutsch & Feroe, 1981). For further discussions of the perceptual status of 12-tone compositions, see Krumhansl, Sandell, and Sergeant (1987), ` s (1958/1988), and in particular Thomson (1991). France

C. Models of Pitch Space


Over the centuries, theorists have proposed representations of pitch and pitch relationships in terms of distances in multidimensional space. For example, in order to capture the close perceptual similarity between tones that stand in octave relation, it has been suggested that pitch be represented as a helix, with the vertical axis corresponding to pitch height and tones separated by octaves lying closest within each turn of the helix (Section V; see also Chapter 6). More elaborate representations have also been proposed that would capture the complex patterns of pitch relationship that are invoked in listening to tonal music. For example, Longuet-Higgins (1962a, 1962b) has suggested that tonal space be characterized as a three-dimensional array: Tones that are adjacent along the first dimension are separated by fifths, those adjacent along the second dimension by major thirds, and those adjacent along the third dimension by octaves. The intervals of tonal music then appear as vectors in this tonal space. On this model, closely related tones, such as form a given major scale, produce a compact group in this array, so that a key can be defined as a neighborhood in tonal space. Similar representations have been proposed by others, such as Hall (1974), Balzano (1980), and Shepard (1982). The spatial modeling of pitch relationships in the context of keys has a long tradition among music theorists. In particular, 18th century theorists developed circular configurations that would capture degrees of modulation between keys. In these models, adjacent positions along such circles depict close modulations, and positions that are further removed depict more distant ones. Later theorists such as Weber (1824) and Schoenberg (1954/1969) have produced related spatial models (Werts, 1983). Leonard Meyer (1956) has argued that the mental representation of pitch relationships in classical tonal music is strongly influenced by hierarchies of relative stability and rest between tones in an established key. As he wrote:
The term tonality refers to the relationships existing between tones or tonal spheres within the context of a particular style system . . . some of the tones of the system

7. The Processing of Pitch Combinations

267

are active. They tend to move toward the more stable points in the systemthe structural or substantive tones. But activity and rest are relative terms because tonal systems are generally hierarchical: tones which are active tendency tones on one level may be focal substantive tones on another level and vice versa. Thus in the major mode in Western music the tonic tone is the tone of ultimate rest toward which all other tones tend to move. On the next higher level the third and fifth of the scale, though active melodic tones relative to the tonic, join the tonic as structural tones; and all other tones, whether diatonic or chromatic, tend toward one of these. Going still further in the system, the full complement of diatonic tones are structural focal points relative to the chromatic notes between them. And, finally, as we have seen, any of these twelve chromatic notes may be taken as substantive relative to slight expressive deviations from their normal pitches. (Meyer, 1956, pp. 214215)

The concept of a hierarchy of prominence for tones within a key was explored by Krumhansl (1979) in a study in which subjects judged similarities between pairs of tones that were presented in a tonal context. Multidimensional scaling of similarity ratings produced a three-dimensional conical structure around which tones were ordered according to pitch height. The components of the major triad formed a closely related structure near the vertex of the cone; the other tones in the major diatonic scale formed a less closely related subset that was further from the vertex, and the remaining pitch classes were more widely dispersed and still further from the vertex. These layers were then hypothesized to represent different degrees of stability for the pitch classes within a key. There is a problem, however, with a representation that assigns to each pitch class a fixed degree of stability within a key regardless of the short term context in which it is embedded; a tone that is heard as highly stable in one context is heard as less stable in others. As a further problem, such a representation does not explain how the different pitch classes within a key are connected so as to form a unified whole. We need to know how tones at each hierarchical level are connected so as to form coherent patterns, and how such patterns are connected across hierarchical levels. Gjerdingen (1988, 2007), Narmour (1990, 1992), and Kim (2011) have all stressed that hierarchies in tonal music are formed of perceptually stable and closed tonal-temporal patterns, rather than nontemporal pitch hierarchies. Deutsch and Feroe (1981) proposed a model for the mental representation of pitch sequences in tonal music in terms of tonal-temporal patterns that are linked together as hierarchies. The model also assumes that there is a hierarchy of pitch alphabets within an established key, though the role of any given pitch class depends on the short-term context in which it occurs. Pitch sequences composed of such alphabets at any one level form structural units at that level. Further, at each level, tones are elaborated by further tones at the next-lower level. Conversely, structural units at any one level contain tones that serve as reference points that unite to form structural units at the next-higher level. A representation of Deutsch and Feroes hierarchy of embedded alphabets is shown in Figure 9. The model assumes that, through extensive exposure to Western tonal music, the listener acquires this repertoire of embedded alphabets,

268

Diana Deutsch

tonic

major triad

major scale

chromatic scale C C D D E F F G G A A B

Figure 9 A hierarchy of embedded pitch alphabets. Adapted from Deutsch and Feroe (1981). 1981 by the American Psychological Association. Adapted with permission.

most prominently the chromatic scale, diatonic scales, and triads. At the lowest level, the chromatic alphabet serves as the parent alphabet from which families of subalphabets are derived. The major and minor scales are represented at the next-higher level; these can be expressed in terms of proximal distances along the chromatic alphabet. Triads are represented at the next-higher level; these can be expressed in terms of proximal distances along diatonic alphabets. Lerdahl (2001) has proposed an elaboration of Deutsch and Feroes hierarchy of alphabets that also takes account of a number of other characteristics of tonal music, such as patterns of proximity between chords (see Lerdahl, 2001, p. 47.) Compositional practice reflects our use of such overlearned alphabets. For example, in the short-term transposition of motives, the number of steps along an alphabet is often preserved, so that even when such transpositions result in alterations in interval size, they still appear appropriate to the listener. Figures 10 and 11 give two such examples. The first, from a Bach fugue, shows a motive that traverses the D-major scale four times in succession, each time beginning on a different position along the scale. The second, from a Schubert impromptu, shows a motive that traverses the Aw-minor triad five times in succession, each time beginning at different positions along the triad. In both cases, preservation of the pitch alphabet has the consequence that the intervals vary in the different instantiations of the motive (Deutsch, 1977, 1978d). There is experimental evidence that pitch structures in Western tonal music are represented by listeners in terms of such embedded alphabets. Deutsch (1980) had subjects listen to sequences of tones that were drawn from such alphabets, and recall what they heard in musical notation. When errors in notation occurred, they rarely departed from the alphabet that had been presented. (So, for example, if a sequence consisted of tones in the G-major triad, erroneous notations would also be in the G-major triad.) In general, sequences were recalled very accurately when they could be simply represented as hierarchical structures, with different pitch alphabets at different levels of the hierarchy (see below).

7. The Processing of Pitch Combinations

269

Log frequency

Time

Figure 10 Transposition along the alphabet of the D-major scale. The same pattern is presented four times in succession at different positions along the scale. Because the major scale consists of unequal intervals, there result differences in the intervals comprising the pattern. The ladder at the right displays the scale. From J. S. Bach, The Well-Tempered Clavier, Book 1, Fugue V. From Deutsch (1977).

Log frequency

Time

Figure 11 Transposition along the alphabet of the Aw-minor triad. The same pattern is presented five times in succession, at different positions along this triad. Because the triad consists of uneven intervals, there result differences in the intervals comprising the pattern. The ladder at the right displays the triad. From F. Schubert, Four Impromptus, Op. 90, No. IV.

270

Diana Deutsch

Further evidence comes from findings that melodies were better remembered when they were composed only of tones in a particular diatonic set than when they ` s, also contained tones outside the set (Cuddy et al., 1981; Dowling, 1991; France 1958/1988). Presumably, adhering to a diatonic set increases the likelihood that the listener would invoke a key, and so use overlearned pitch alphabets as an aid to memory. It has also been reported that altering the context of a melody so as to suggest a different key rendered the melody more difficult to recognize (Dowling, 1986). Yet other studies have found that transposed melodies that did not involve a change in key were judged as very similar to the original ones, regardless of whether or not the intervals were preserved (Bartlett & Dowling, 1980; Dewitt & Crowder, 1986; Dowling, 1978, 1986; Takeuchi & Hulse, 1992; Van Egmond & Povel, 1994b). In addition, an alteration in a melody has been found easier to detect when it could be interpreted as a departure from its key and so as departing from ` s, 1958/1988, Dewar, Cuddy, & Mewhort, the alphabets appropriate to the key (France 1977; Dowling, 1978).

D. The Deutsch/Feroe Model


The model proposed by Deutsch and Feroe (1981) (hereafter termed D&F) describes how pitch sequences in tonal music are encoded and represented in memory. Music theorists have argued that Western tonal music is composed of segments that are organized in hierarchical fashion (Lerdahl & Jackendoff, 1983; Meyer, 1956, 1973; Narmour, 1990, 1992; Schenker, 1956), and it is reasonable to suppose that this form of organization reflects the ways in which musical information is encoded and retained. As Greeno and Simon (1974) point out, we appear to retain many different types of information as hierarchies. We also appear to retain hierarchies of rules (Scandura, 1970), of programs (Miller, Galanter, & Pribram, 1960), and of goals in problem solving (Ernst & Newell, 1969). Visual scenes appear to be retained as hierarchies of subscenes (Palmer, 2002). The phrase structure of a sentence lends itself readily to hierarchical interpretations (Miller & Chomsky, 1963). Restle (1970) and Restle and Brown (1970) have provided evidence that we readily acquire serial patterns as hierarchies that reflect the structure of these patterns. Parallel theoretical developments by Simon and his colleagues (Simon, 1972; Simon & Kotovsky, 1963; Simon & Sumner, 1968) and by others (Jones, 1978; Leewenberg, 1971; Vitz & Todd, 1969) have addressed the ways in which we acquire and retain serials patterns in terms of hierarchies of operators. The D&F model is in the coded-element tradition, but it differs fundamentally from others in its basic architecture. The structural units of the model are sequences that are organized in accordance with universal grouping principles, such as proximity and good continuation. Structural units can also be based on schemata that have been acquired through exposure to the music of a particular tradition. These structural units combine to form a hierarchical network, in which elements that are present at any given level are elaborated by further elements so as to form structural units at the next-lower level, until the lowest level is reached. It should be

7. The Processing of Pitch Combinations

271

(a)

(b)

(c)

B-C

D -E

F -G

B-C

Figure 12 A series of pitches represented on two hierarchical levels. (a) At the higher level, there is an arpeggiation of the C-major triad. (b) At the lower level, each note of the triad is preceded by one a semitone lower, so forming a two-note pattern. (c) The hierarchical structure as a tree diagram. Adapted from Deutsch and Feroe (1981). 1981 by the American Psychological Association. Adapted with permission.

emphasized that although the model focuses on Western tonal music of the common practice era, it can equally well be applied to the music of other periods and cultures, and it assumes only that, through long-term exposure to music in a given style, listeners have become familiar with the pitch alphabets of the music in that style. The model is introduced by a musical example. The pitch sequence shown in Figure 12b can, in principle, be represented in terms of steps along the chromatic scale: A basic subsequence consisting of a step up this scale is presented four times in succession, the second instantiation being four steps up from the first, the third being three steps up from the second, and the fourth being five steps up from the third. This analysis assigns prominence to the basic subsequence and does not relate its different instantiations in a meaningful way. A musical analysis of this pattern would instead describe it as on the two structural levels shown in Figures 12a and 12b. The basic relationship expressed here is that of the elaboration of a higher-level subsequence by a lower-level subsequence. The higher level, shown in Figure 12a, consists of an arpeggiation that ascends through the C major triad (C-E-G-C). At the lower level, each note of the triad is preceded by a neighbor embellishment, so that the two-note patterns (B-C), Dx-E), (Fx-G), (B-C) are formed. Figure 12c represents this hierarchical structure in tree form.

272

Diana Deutsch

Specifically, a simplified version of the D&F model is as follows: 1. A structure is notated as (A1, A2, . . ., Al22, Al21, , Al11, Al12, . . ., An), where Aj is one of the operators n, p, s, ni, or pi. The asterisk ( ) provides a reference point for the other operators, and appears exactly once in the structure. 2. Each structure (A1, A2, . . ., , . . ., An) has associated with it an alphabet, . The combination of a structure and an alphabet is called a sequence (or subsequence). This, together with the reference element r, produces a sequence of notes. 3. The effect of each operator in a structure is determined by that of the operator closest to it, but on the same side as the asterisk. Thus the operator n refers to traversing one step up the alphabet associated with the structure. The operator p refers to traversing one step down this alphabet. The operator s refers to remaining in the same position. The two operators ni and pi refer to traversing up or down i steps along the alphabet, respectively. 4. The values of the sequence of notes (A1, A2, . . ., , . . ., An), , r, where is the alphabet and r the reference element, are obtained by taking the value of the asterisk to be that of r. 5. To produce another sequence from the two sequences A 5 (A1, A2, . . ., , . . ., Am) , and B 5 (B1, B2, . . ., , . . ., Bn), , where and are two alphabets, we define the compound operator pr (prime). A[pr]B;r, where r is the reference element, refers to assigning values to the notes produced from (B1, B2, . . ., , . . ., Bn) such that the value of the asterisk is the same as the value of A1, when the sequence A is applied to the reference element r. Values are then assigned to the notes produced from (B1, B2, . . ., , . . ., Bn) such that the value of the asterisk is the same as the value of A2, and so on. This gives a sequence of length m 3 n. Other compound operators such as inv (inversion) and ret (retrograde) are analogously defined. So according to the formalism just outlined, the pattern shown in Figure 12 can be represented as: A 5 ; 3nCtr B 5 p; Cr S 5 AprB; C4 where Ctr represents the C-major triad, Cr the chromatic scale, and C4 the reference element. In other words, sequence A consists of a reference point followed by two successive steps along the C-major triad. Sequence B represents an ascending half-step that ends on a reference point. To combine these two sequences so as to produce the full sequence, the reference element C4 replaces the reference point in sequence A; this produces the sequence of notes (C4E4G4C5). The sequence B is then applied to each note of sequence A, taking each note of sequence A as the reference point. This produces the entire sequence of notes (B3-C4-Dx4-E4-Fx4-G-B4-C5). In many other hierarchical representations of music, such as proposed by Schenker (1956) and the coded element models referred to earlier, elements that are present at all but the lowest level are rule systems rather than actual notes.

7. The Processing of Pitch Combinations

273

In contrast, in the D&F model, an actual sequence of notes is realized at each structural level. This confers the advantage that notes that are present at any given level are also present at all levels below it. In consequence, the higher the level at which a note is represented, the more often and so the more firmly it is represented. This has the consequence that higher-level subsequences serve to cement lower level subsequences together. As a further advantage, by repeatedly invoking the same structure, the model enables long sequences to be encoded in parsimonious fashionessentially acting as a compression algorithm. A related processing advantage is that the model enables subsequences at different structural levels to be encoded as chunks of a few items each; this in turn is conducive to optimal memory performance (Anderson, Reder, & Lebiere, 1996; Estes, 1972; Wickelgren, 1967). As another processing advantage, the D&F model enables the encoding of subsequences in terms of laws of figural goodness, such as proximity and good continuation, and also enables the invocation of melodic schemata and archetypes in the representation of subsequences. This has the effect of binding the tones within subsequences together, and so also helps the listener to apprehend and remember the full sequence. As yet a further advantage, the model enables different pitch alphabets to be invoked at different hierarchical levels. The use of multiple alphabets here has the benefit of helping to clarify and disambiguate the different levels of the hierarchy. Experimental evidence indicates that listeners process pitch sequences in accordance with the D&F model when given the opportunity to do so. One hypothesis that arises from the model is that a sequence of notes should be processed more easily when it can be parsimoniously represented in accordance with its rules. In an experiment to test this hypothesis, Deutsch (1980) presented musically trained listeners with sequences of notes, which they recalled in musical notation. Examples of these sequences are shown in Figure 13. The passage in Figure 13a (a structured sequence) consists of a higher-level subsequence of four elements that acts

(a)

(b)

Figure 13 Examples of sequences used in the experiment to study utilization of pitch structure in recall. Sequence (a) can be represented parsimoniously as a higher-level subsequence of four elements (an arpeggiation of the G-major triad) that acts on a lowerlevel subsequence of three elements (a step down and then up the chromatic scale). Sequence (b) consists of a haphazard reordering of the notes in sequence (a) and cannot be parsimoniously represented. Adapted from Deutsch (1980).

274

Diana Deutsch

(a)

(b)

(c)

Figure 14 Types of temporal structure used in the experiment to study the utilization of pitch structure in recall. (a) Sequence unsegmented. (b) Sequence segmented in groups of three, so that segmentation is in accordance with pitch structure. (c) Sequence segmented in groups of four, so that segmentation is in conflict with pitch structure.

on a lower-level subsequence of three elements. The passage in Figure 13b (an unstructured sequence) consists of a haphazard reordering of the passage in Figure 13a, and does not lend itself to a parsimonious representation. It was predicted, on the basis of the model, that the structured sequences would be notated more accurately than the unstructured ones. Another factor was also examined in this experiment. It has been found in studies using strings of verbal materials that we tend to recall such strings in accordance with their temporal grouping (Bower & Winzenz, 1969; McLean & Gregg, 1967; Mueller & Schumann, 1894). This effect was found to be so powerful as to offset grouping by meaning (Bower & Springston, 1970). Analogous results have also been obtained using nonverbal materials (Dowling, 1973; Handel, 1973; Restle, 1972). It was predicted, therefore, that temporal grouping would affect ease of recall of the present tonal sequences in analogous fashion. In particular, temporal grouping in accordance with pitch structure was expected to enhance performance, whereas grouping in conflict with pitch structure was expected to result in performance decrements. See London (2012) for an excellent discussion of the effects of timing on perception of pitch structures. Given these considerations, sequences such as these were presented in three temporal configurations (Figure 14). In the first, the tones were spaced at equal intervals; in the second, they were spaced in four groups of three, so that they were segmented in accordance with pitch structure; in the third, they were spaced in three groups of four, so that they were segmented in conflict with pitch structure. Large effects of both pitch structure and temporal segmentation were obtained. For structured sequences that were segmented in accordance with pitch structure, performance levels were very high. For structured sequences that were unsegmented, performance levels were still very high, though slightly lower. However, for structured sequences that were segmented in conflict with pitch structure, performance levels were much lower. For unstructured sequences,

7. The Processing of Pitch Combinations

275

performance levels were considerably lower than for structured sequences that were segmented in accordance with their structure or that were unsegmented; instead, they were in the same range as for structured sequences that were segmented in conflict with pitch structure. Figure 15 shows the percentages of tones that were correctly recalled in their correct serial positions in the different conditions of the experiment. Typical bowshaped curves are apparent, and in addition, discontinuities occur at the boundaries between temporal groupings. This pattern of results indicates that the subjects encoded the temporal groupings as chunks, which were retained or lost independently of each other. This pattern is very similar to that found by others with the use of verbal materials (Bower & Winzenz, 1969). The transition shift probability (TSP) provides a further measure of interitem association. This is defined as the joint probability of either an error following a correct response on the previous item, or of a correct response following an error on the previous item (Bower & Springston, 1970). If groups of elements tend to be retained or lost as chunks, we should expect the TSP values to be smaller for transitions within a chunk, and larger for the transition into the first element of a chunk. It was indeed found that TSPs were larger on the first element of each temporal grouping than on other elements. This is as expected on the hypothesis that temporal groupings serve to define subjective chunks that are retained or lost independently of each other. In general, the findings of Deutsch (1980) provide strong evidence that listeners perceive hierarchical structures that are present in tonal sequences, and that they use such structures in recall. For the structured sequences used here, the listener needed only to retain two chunks of three or four items each; however, for the unstructured sequences, no such parsimonious encoding was possible. The error rates for the unstructured sequences were much higher than for the structured sequences, in accordance with the hypothesis that they imposed a much heavier memory load. Another study was carried out by Van Egmond and Povel (1996). A paired comparison paradigm was employed to investigate perceived similarities between melodies and their transpositions, when the latter had been altered in various ways. The D&F model was used as a qualitative predictor of the degree of perceived similarity between the original and transposed melodies. The authors hypothesized that the larger the number of items by which the codes for the original and transposed melodies differed, the more dissimilar the two melodies would appear. More specifically, Van Egmond and Povel predicted that an exact transposition would be judged as most similar to the original melody, because its code would differ only in terms of one item; i.e., the key. For a transposition that was chromatically altered, the prediction concerning perceived similarity would depend on whether the transposed melody could be represented parsimoniously in the same key as the original. If it could be so represented, then its code would differ in terms of only one itemthe reference element. If it could not be so represented, then its code would differ in terms of two itemsthe key and the reference element. Finally, a transposition that was diatonically altered would be judged as most

276

Diana Deutsch

100

100

75

75

50

3S

50

3U

25

25

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12

100 Percentage of tones recalled

100

75

75

50

4S

50

4U

25

25

0 1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12

100

100

75

75

50

0S

50

0U

25

25

1 2 3 4 5 6 7 8 9 10 11 12

0 Serial position

1 2 3 4 5 6 7 8 9 10 11 12

Figure 15 Serial position curves for the different conditions of the experiment to study the utilization of pitch structure in recall. 3: Temporal segmentation in groups of three. 4: Temporal segmentation in groups of four. 0: No temporal segmentation. S: Structured sequence. U: Unstructured sequence. From Deutsch (1980).

7. The Processing of Pitch Combinations

277

dissimilar to the original melody, because its code would differ in terms of six itemsthe key and five structure operators. The experimental findings confirmed the hypothesis. Exact transpositions were judged to be most similar to the original melodies. Chromatically altered transpositions that could be interpreted as in the same key as the original melodies were judged to be more similar than were those that could not be so interpreted. Transpositions that were diatonically altered were judged to be more dissimilar than were chromatically altered transpositions. In a further set of experiments, Hamaoui and Deutsch (2010) constructed two groups of sequences. Those in one group could be parsimoniously represented in hierarchical fashion according to the D&F rules. Those in the other group were unstructured, but they matched the structured sequences in terms of starting pitch, number of changes in pitch direction, overall pitch movement, and interval size content. The effect of grouping by hierarchical structureas measured by the duration of conflicting temporal gaps required to overrule itwas found to be remarkably strong. In yet another study, Oura (1991) presented subjects with a melody, which they recalled in musical notation. Tones that were represented at higher structural levels were recalled better than were those that were represented at lower levels. Further, Dibben (1994) had subjects listen to a musical segment, and then to a pair of reductions, and they judged which reduction best matched the full segment. She found that the subjects chose the version that matched the full segment at higher structural levels. The findings from both these studies are in accordance with the prediction from the D&F model, that the higher in a tonal-temporal hierarchy a note or sequence of notes is represented, the more often it is represented, and so the more firmly it should be embedded in memory (see also Wang & Sogin, 1990). So far we have been considering the processing of a single melodic line. However, tonal music generally involves several such lines, and even where only one is presented, a harmonic progression is generally implied. We can assume that such progressions are also encoded in hierarchical fashion. In addition, the use of parallel linear sequences, which must also combine to form an acceptable harmonic sequence, places constraints on the choice of elements in each sequence; this in turn serves to reduce the processing load.

E. Acquisition of a Representation
We now consider how the D&F model addresses the process by which the listener acquires a representation of a passage. The model assumes that an initial set of subsequences is formed on the basis of simple organizational principles, such as proximity and good continuation. We can also assume that the listeners choice of a dominant note in a subsequencewhich then serves as a reference pointis also initially guided by low-level factors, such as an increase in loudness or duration, metrical stress, and the temporal position of the note in the subsequence.

278

Diana Deutsch

We can consider, as an example, the sequence in Figure 16, which was derived from Figure 1 of Deutsch and Feroe (1981). This pattern can be described as an arpeggiation that ascends through the C-major triad (E-G-C) with each note of the triad preceded by a neighbor embellishment. In other words, the notes E, G, and C are targeted for representation at a higher level, as shown in the associated tree diagram. As a result, the key of C major is clearly attributed, even though two of the notes in the sequence (Dx and Fx) are outside the C-major scale. However, when the identical sequence of notes is played in reverse order, as shown in Figure 17, it is no longer heard as in C major, but instead as in E minor. We target the notes B, Fx, and Dx so as to form the subsequence (B-Fx-Dx) at the next-higher level, as shown in the associated tree diagram. So we hear an arpeggiation that descends through the B-major triad, and we hear it as implying the dominant of E minor, leading us to attribute the key of E minor instead. Deutsch (1984) suggested that this paradoxical outcome is based on the following process: Using primitive organizational principles, the listener forms low-level groupings from the two-note patterns that are related by proximity, and then assigns prominence to the second note of each two-note pattern. This leads to the assignment of the subsequence (E-G-C) at the higher level when the sequence is played forward, and (B-Fx-Dx) when the sequence is played backward.

Figure 16 Pitch sequence to illustrate the effect of a particular temporal ordering on a given collection of tones. This sequence is heard as in C major although two tones are outside the C-major scale. The tree diagram illustrates the hypothesized mental representation of this sequence. Adapted from Deutsch (1984).

D -E

F -G

B-C

Figure 17 The identical pitch sequence as in Figure 16, but presented in reverse order. The tree diagram illustrates the hypothesized mental representation of this sequence, which is heard as in E minor. Adapted from Deutsch (1984).

C-B

G-F

E-D

7. The Processing of Pitch Combinations

279

As another example of the strong influence of ordering, we can consider the passages shown in Figure 13. Passage (a) (G-Fx-G-D-Cx-D-B-Ax-B-G-Fx-G) clearly invokes the key of G major, even though two of the notes (Cx and Ax) are outside the G-major scale. Again, the listener forms low-level groupings based on pitch proximity (G-Fx-G, and so on), and targets the notes (G-D-B-G) to form a subsequence at the next-higher level. However, when the same set of notes is played in haphazard order, as in Passage (b), the listener cannot form a parsimonious hierarchical representation of the passage, so the key becomes ambiguous. So the D&F model and the associated experiments clarify that (1) key assignments can be readily made for passages that include tones outside the scale for the assigned key, (2) they are strongly dependent on the ordering of the notes in the passage, and (3) listeners can use simple organizational principles based on ordering to create a hierarchical structure from these notes, and so to assign a key to the passage. Kim (2011) has addressed the important question of why the listener chooses the second of each pair of notes in the examples in Figures 16 and 17 as the dominant note. He pointed out that melodic steps have been proposed by several music theorists to have an inhibitory effect. For example, Komar (1971) described the second note of a linear pair as the stepwise displacement of the first note. Further, Larson (1997) observed that this concept relies on the distinction between steps and leaps: In a melodic step (defined as a distance of one or two semitones), the second note tends to displace the trace of the first note in memory, so that it becomes the more prominent note. Kim proposed, therefore, that resulting from stepwise displacement, the listener perceives the second note of each two-note grouping as more prominent, and so targets this note for representation at a higher structural level. Bharucha (1984a, 1984b) has advanced the alternative proposal that the listener needs to assign a key in order to generate a hierarchy of prominence of notes within a passage. In other words, he proposed that the decision as to which notes assume prominence is driven by the internalized knowledge of hierarchies of prominence within a key (see also Krumhansl, 1990). In contrast, Kim (2011), while

D B Bb A G G F F E

Eb

B-G -F Bb-G-E

A-F -D

G -F-D

G-E-C

F -Eb-C F-D-B

E-C -Bb Eb-C-A

Figure 18 Passage with a clear hierarchical structure independent of key. The higher-level subsequence consists of a descending chromatic scale, and the lower-level subsequences are all diminished triads. See text for details. From Prelude VI in D minor, by J. S. Bach.

280

Diana Deutsch

acknowledging that top-down processing is also invoked, including making reference to an established key, contended that bottom-up processes are heavily involved in establishing hierarchies of prominence. In this context, we can observe that the D&F model does not require that listeners first attribute a key in order to acquire a hierarchical representation of a passage. The passage in Figure 18, taken from Bachs Prelude in D Minor, consists of a higher-level subsequence that traverses the chromatic scale from B5 down to D5. Each note in this subsequence is elaborated by an arpeggiation that descends through the diminished triad. The full sequence so produced can be notated as: A 5 9n; Cr B 5 ; 2p dimtr S 5 A prB; D5 where Cr indicates the chromatic alphabet, and dimtr indicates the diminished triad. The sequence ends on the note D (the tonic) but could in principle have ended on any note in the chromatic set. So rather than relying on an established key, these hierarchical representations play a large role in the process of key identification itself, through an elaborate bootstrapping operation in which different cues feed back on each other.

F. Other Approaches to Key Identification


A number of approaches to key identification have been taken, and these fall into several categories (see also Temperley, Chapter 8). One approach holds that listeners possess a template that represents the distribution of pitch classes for each of the 12 major and minor keys. When a piece is heard, its pitch class distribution is compared with that in each of the templates, and the template that provides the best match wins. This view assumes that the ordering of the different pitch classes in a piece is ignored, with only the statistical distribution of the pitch classes remaining. An early model in this category was proposed by Longuet-Higgins and Steedman (1971). As a first pass, the model assumes that as a note is sounded, it eliminates all keys whose scales do not contain that note. This process continues until only one candidate key remains. A problem with this model is that it cannot account for correct key identifications of passages containing notes that are outside the scale for that key, as in the examples given in Figures 12 and 13a. Krumhansl and Schmuckler (1986; see also Krumhansl, 1990) proposed a distributional model based on a set of key profiles, which were derived from a study by Krumhansl and Kessler (1982) (hereafter termed K&K). To generate the profiles, musically trained subjects were presented with a musical context (a scale, chord, or cadence) that was followed by a probe tone, and they judged how well the probe tone fit in the context provided. Probe-tone ratings were obtained for all 12 pitch classes in each context. The ratings from the different keys and contexts

7. The Processing of Pitch Combinations

281

were then averaged so as to generate a single major-key profile and a single minorkey profile. The procedure used to generate the K&K profiles has been criticized on a number of grounds. In particular, in averaging across the contexts provided to the subjects, taking the major and minor keys separately (the procedure used by K&K), one obtains distributions of the number of repetitions of each pitch class that correspond remarkably well to the profiles obtained from the subjects rating judgments (Butler, 1989). The profiles could, therefore, simply reflect enhancement by repetition in short-term memory. Later, Huron and Parncutt (1993) and Leman (2000) produced models that simulated K&Ks probe tone data, but were based on shortterm memory effects. Further, Deutsch (1970a, 1972a, 1975a) observed that repetition of the pitch of a tone in an atonal setting resulted in memory enhancement for that tone; these findings produced direct evidence that a probe tone should be heard as more salient as a function of its repeated presentation in the context provided (see Section IV). In other work, Oram and Cuddy (1995) and Creel and Newport (2002) carried out probe-tone studies employing context melodies that were generated from artificial pitch class distributions designed to be very dissimilar to those in major or minor scales. The subjects judgments correlated with the pitch class distributions in the context melodies, so that those pitches that occurred more often in the context were given higher ratingsfindings that are again attributable to repetition effects in short-term memory. In sum, since probe tone ratings are strongly influenced by short-term contexts, they cannot be assumed by default to reflect long-term exposure to music of the listeners tradition. Another argument has been advanced by Temperley and Marvin (2008), Aarden (2003), and Huron (2006), based on statistical analyses of large samples of Western tonal music. These authors found that although the K&K profiles correlated with the distributions of pitch classes within keys, the correlations were imperfect, and for certain scale degrees there were substantial discrepancies between the profiles and the actual distributions. At all events, the Krumhansl and Schmuckler algorithm adds information about note duration to the K&K profiles, and then determines the key of a passage (or piece) by comparing its pitch class distribution with the amended K&K profiles for each of the 12 major and minor keys and choosing the one with best fit. Other models based on the distribution of pitch classes in a passage or a piece have been proposed by Chew (2002), Vos and Van Geenen (1996), Yoshino and Abe (2004), and Temperley (2007). The distributional approach to key finding has been criticized on the grounds that it neglects the effect of temporal ordering of the pitches in a passage. Several alternative approaches that emphasize temporal ordering have been proposed. Most prominently, Butler, Brown, and colleagues (Brown, 1988; Brown & Butler, 1981; Brown, Butler, & Jones, 1994; Browne, 1981; Butler, 1989; Butler & Brown, 1984; Van Egmond & Butler, 1997) have contended that key identification is strongly influenced by the presence of rare intervals within a key; in particular, minor seconds and tritones. Their work has focused on the tritone, which in the major scale ^ and 7). ^ Even considering the tritone, only occurs between two scale degrees (4

282

Diana Deutsch

ordering is important: for example, F-B implies the key of C whereas B-F implies the key of Fx. Vos (1999) also emphasized the importance of certain melodic intervals for key identification. Specifically, he proposed that a rising fifth or a descending fourth at the beginning of a melody provides important cues. In addition, Callender, Quinn, and Tymoczko (2008) have proposed a substantial model of voice leading that emphasizes the ordering of chord progressions. Evidence for the strong influence of ordering was provided in the study by Deutsch (1980) discussed earlier. It was shown that a set of pitches that were ordered in such a way that they could be encoded parsimoniously as phrases in tonal music were easily processed, whereas the same set of pitches that were reordered haphazardly were processed only poorly (Figure 13). Also, as described in Deutsch (1984), the sequence shown in Figures 16 and 17 can be heard either as in C major or as in E minor, depending on whether it is played forward or backward. Further experimental evidence for the importance of ordering and intervallic information was obtained by Brown (1988). Subjects were presented with pitch class sets that were ordered either to evoke a particular tonal center, or to evoke a different tonal center, or to be tonally ambiguous. The subjects key judgments were strongly influenced by these manipulations (see also Brown et al., 1994). Matsunaga and Abe (2005) also found that subjects choices of tonal centers for passages were influenced by the orderings of the presented tones. In another experiment, West and Fryer (1990) presented subjects with quasi-random orderings of the tones in a diatonic scale, in each case followed by a probe tone, and the subjects judged the suitability of the probe tone as a tonic in the context of the sequence they had just heard. It was found that the actual tonic was not judged as uniquely ^ were rated as ^ 3, ^ 4, ^ and 5 suitable as the tonal center; instead scale degrees 1, equally suitable. Smith and Schmuckler (2004) created sequences in which the K&K profiles (or variants of these) were used to create distributions of the durations and frequencies of occurrence of the different pitch classes, which were then randomly ordered. Subjects were presented with these sequences, and they produced probe-tone profiles that were used by the authors to draw inferences concerning perceptions of key for these sequences. The tone profiles that the subjects produced were found to be similar to the original K&K profiles from which the sequences were derived. The authors interpreted this result to reflect the subjects use of long-term knowledge of pitch class distributions within keys in making their judgments. However, since very similar distributional contexts were employed to generate both the original K&K profiles and the profiles obtained in their experiment, the results could instead have reflected the similarity of these two short-term contexts, rather than reflecting the use of long-term mental templates. Based in part on this reasoning, Temperley and Marvin (2008) argued that, rather than drawing inferences from probe tone responseswhich are equivocal in their interpretationa better procedure would be to have subjects identify the key of a passage explicitly. They also argued that subjects judgments should be compared against pitch class distributions that are found in actual music, because probe-tone profiles correlate only imperfectly with these distributions.

7. The Processing of Pitch Combinations

283

Reasoning along these lines, Temperley and Marvin presented subjects with melodies that were generated quasi-randomly from scale-degree distributions. The distributions were created from the first eight measures of each of the string quartet movements by Haydn and Mozart. The authors then created a profile displaying the proportion of events of each scale degree for each passage. The profiles from all major-key passages were averaged to create a major-key profile, and the analogous procedure was used to create a minor-key profile. The profiles were then employed to generate scale degrees in a stochastic fashion, so as to produce the presented melodies. The subjects, who were musically trained, listened to each passage, and then made explicit key judgments by locating the tonic on a keyboarda task that is easy for musically trained listeners to accomplish. It was found that only slightly more than half of the subjects judgments of the presented melodies matched the generating key. In a further analysis, the authors determined for each melody the key that was chosen by the largest number of subjects, and they found that judgments of this key accounted for only 56.1% of the key judgments, showing that the subjects disagreed among themselves substantially in their choice of key. From these findings, Temperley and Marvin concluded that listeners perform poorly in determining the key of a melody when it is generated from pitch class distributions alone, and that structural cues produced by the ordering of the tones in the sequence must also be employed in the process of key identification.

IV.

The Organization of Short-Term Memory for Tones

We here present a theoretical framework for the representation of tones in shortterm memory (otherwise known as working memory). This issue is fundamental to our understanding of music perception and cognition, because tones form the basic units from which musical structures are derived. Indeed, as we have argued, certain characteristics of higher-level tonal organization can be attributed to interactions between tones at this basic level. It is evident from general considerations that memory for music must be the function of a heterogeneous system, whose various subdivisions differ in the persistence with which they retain information. For example, the system that subserves memory for pitch relationships must be capable of retaining information over very long periods of time, whereas this is not true of the system that retains absolute pitch values. Similarly, the system that retains temporal patterns must preserve information for considerably longer than the system that retains absolute values of duration. Based on such considerations, we can assume that when memory for a musical pattern is tested after various time periods have elapsed, differences in its form of encoding would emerge. More specifically, the model assumes that musical tones are initially subjected to a set of perceptual analyses, which are carried out in different subdivisions of the auditory system. Such analyses result in the attribution of values of pitch,

284

Diana Deutsch

loudness, duration, and so on, as well as values resulting from higher-level analyses, such as intervals, chords, rhythms, and timbres. It is further assumed that in many of these subsystems, information is represented along arrays that are systematically organized with respect to a simple dimension, such as pitch, loudness, or duration, or some higher-level dimension such as interval size, or in a multidimensional space, such as timbre. The model further assumes that the outputs of these analyses are projected onto arrays in corresponding subdivisions of the auditory memory system. So, for example, one subdivision retains values of pitch, and others retain values of duration, loudness, interval size, timbre; and so on. Information is retained in parallel in these different subdivisions; however, the time constants of retention in these subdivisions vary considerably. It is further assumed that specific interactions take place within these subdivisions that are analogous to those that occur in systems processing auditory information at the incoming level. The outputs of these different subdivisions then combine during retrieval of information from memory. Neurophysiological findings support the hypothesis of multiple auditory memory stores that subserve different stimulus attributes. When a listener is presented with a series of identical tones followed by a new tone, the new tone elicits an eventrelated brain potential called the mismatch negativity or MMN, which is assumed to reflect the detection of a difference between the incoming stimulus and the stimuli that have been stored in memory. Giard et al. (1995) analyzed the MMNs elicited by pure tones that deviated from standard tones in frequency, intensity, or duration. They found that the scalp topographies of the MMNs varied according to type of stimulus deviance, and they concluded that the frequency, intensity, and duration of a sound have separate neural representations in memory. In addition, MMNs obtained from tones that differed in terms of two features have been found to be roughly equal to the sum of the MMNs obtained from tones that differed in terms of a single featureindicating that the standard tones leave multiple representations in the brain (Levanen, Hari, McEvoy, & Sams, 1993; Schroger, 1995). Within this framework of multiple parallel stores, we first focus on memory for pitch, and examine how values of this attribute are represented in storage and how they are accessed during retrieval. We then consider how other attributes of tone are represented in memory.

A. The System That Retains Absolute Pitch Values


In considering the characteristics of the system that retains absolute pitch values, a number of hypotheses may be advanced. For example, such memory might simply deteriorate with the passage of time. Another possibility is that pitch information is retained in a general store that is limited in terms of the number of items it can retain, so that memory loss results from a general information overload. As a third possibility, memory for pitch might be the function of an organized system whose elements interact in specific ways.

7. The Processing of Pitch Combinations

285

We can begin with the following observations. When a tone is presented, and this is followed immediately by another tone that is either identical in pitch to the first or that differs by a semitone, most listeners find it very easy to determine whether the two tones are the same or different in pitch. The task continues to be very easy when a silent interval of 6 s intervenes between the tones to be compared. Although memory for pitch has been shown to fade gradually with the pas ment, Demany, & Semal, 1999; Harris, 1952; sage of time (Bachem, 1954; Cle Kaernbach & Schlemmer, 2008; Rakowski, 1994; Wickelgren, 1966, 1969), the amount of fading during a silent retention interval of 6 s is so small that it is barely apparent in this situation. However, when eight extra tones intervene during the retention interval, the task becomes strikingly difficult, and this is true even when the listener is instructed to ignore the intervening tones. Deutsch (1970b) found that listeners who made no errors in comparing such tone pairs when they were separated by 6 s of silence made 40% errors when eight tones intervened during the retention interval. In a companion experiment, either four,