Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
DOI: 10.1523/JNEUROSCI.0938-17.2017
Author contributions: K.C.P. and J.Z.S. designed research; K.C.P. and J.Z.S. performed research; K.C.P. and
J.Z.S. analyzed data; K.C.P. and J.Z.S. wrote the paper.
The authors thank Elizabeth Nguyen for excellent MEG technical support, and Stuart Rosen for advice on the
stimulus paradigm. This work was supported by NIH grant R01 DC014085.
Alerts: Sign up at www.jneurosci.org/cgi/alerts to receive customized email alerts when the fully formatted
version of this article is published.
Accepted manuscripts are peer-reviewed but have not been through the copyediting, formatting, or proofreading
process.
a
24 Department of Electrical & Computer Engineering, University of Maryland,
28
29 Corresponding author: Jonathan Z. Simon (jzsimon@umd.edu), Department of
32
33 Abbreviated title: Foreground and Background Speech Representations
34 Number of pages: 45
35 Number of figures: 5
41 technical support, and Stuart Rosen for advice on the stimulus paradigm. This
1
43 Abstract
44 The ability to parse a complex auditory scene into perceptual objects is facilitated
49 graphy (MEG) recordings from human subjects, both men and women, we
54 the entire auditory scene. Here, both attended and ignored speech streams are
55 represented with almost equal fidelity, and a global representation of the full
56 auditory scene with all its streams is a better candidate neural representation than
58 that higher order auditory cortical areas represent the attended stream separately,
59 and with significantly higher fidelity, than unattended streams. Furthermore, the
2
64 Significance Statement:
68 stages of auditory cortex. We show that the primary-like areas in auditory cortex
70 scene, with both attended and ignored speech streams represented with almost
71 equal fidelity. In contrast, we show that higher order auditory cortical areas
72 represent an attended speech stream separately from, and with significantly higher
76
3
77 Introduction
79 mix linearly and irreversibly before they enter the ear, yet are perceived as distinct
80 objects by the listener (Cherry, 1953; Bregman, 1994; McDermott, 2009). The
83 performs this task with ease. The neural mechanisms by which this perceptual un-
85 scene and its constituents, and the role of attention in both, are key problems in
88 and Johnsrude, 2003; Hickok and Poeppel, 2007; Rauschecker and Scott, 2009;
89 Okada et al., 2010; Peelle et al., 2010; Overath et al., 2015) with subcortical areas
90 projecting onto the core areas of auditory cortex, and from there, on to belt,
91 parabelt and additional auditory areas (Kaas and Hackett, 2000). Sound entering
93 different latencies (Recanzone et al., 2000; Nourski et al., 2014). Due to this serial
95 both anatomy and latency, of which the latter may be exploited using the high
4
96 temporal fidelity of non-invasive magnetoencephalography (MEG) neural
97 recordings.
98 In selective listening experiments using natural speech and MEG, the two
99 major neural responses known to track the speech envelope are the M50TRF and
100 M100TRF, with respective latencies of 30 80 ms and 80 150 ms, of which the
101 dominant neural sources are, respectively, Heschl's gyrus (HG) and Planum
102 temporale (PT) (Steinschneider et al., 2011; Ding and Simon, 2012b).
103 Posteromedial HG is the site of core auditory cortex; PT contains both belt and
105 (Griffiths and Warren, 2002; Sweet et al., 2005). Hence the earlier neural
106 responses are dominated by core auditory cortex, and the later are dominated by
107 higher-order areas. To better understand the neural mechanisms of auditory scene
109 auditory scene change from the core to the higher order auditory areas.
110 One topic of interest is whether the brain maintains distinct neural
111 representations for each unattended source (in addition to the representation of the
113 single monolithic background object. A common paradigm used to investigate the
114 neural mechanisms underlying auditory scene analysis employs a pair of speech
115 streams, of which one is attended, which then leaves the other speech stream
5
116 remaining as the background (Kerlin et al., 2010; Ding and Simon, 2012b;
117 Mesgarani and Chang, 2012; Power et al., 2012; Zion Golumbic et al., 2013b;
118 O'Sullivan et al., 2015). This results in a limitation, which cannot address the
119 question of distinct vs. collective neural representations for unattended sources.
120 This touches on the long-standing debate of whether auditory object segregation is
122 al., 2005; Shinn-Cunningham, 2008; Shamma et al., 2011). Evidence for
124 former, whereas a lack of segregated background objects would support the latter.
127 two major hypotheses: that the dominant representation in core auditory cortex is
128 of the physical acoustics, not of separated auditory objects; and that once object-
129 based representations emerge in higher order auditory areas, the unattended
131 background object. The methodological approach employs the linear systems
132 methods of stimulus prediction and MEG response reconstruction (Lalor et al.,
133 2009; Mesgarani et al., 2009; Ding and Simon, 2012b; Mesgarani and Chang,
135
6
136 Materials & Methods:
137 Subjects & Experimental Design Nine normal-hearing, young adults (6 Female)
138 participated in the experiment. All subjects were paid for their participation. The
141 segments spoken by, respectively, a male adult, female adult and a child speaker.
142 The three speech segments were mixed into a single audio channel with equal
143 perceptual loudness. All three speech segments were taken from public domain
146 than 300 ms were replaced by a shorter gap whose duration was chosen randomly
147 between 200 ms and 300 ms. The audio signal was low-pass filtered with cut-off
148 at 4 kHz. In first of three conditions, the subjects were asked to attend to the child
149 speaker, while ignoring the other two (i.e., child speaker as target, with male and
150 female adult speakers as background). In condition two, during which the same
151 mixture was played as in condition one, the subjects were instead asked to attend
152 to the male adult speaker (with female adult and child speakers as background).
153 Similarly, in condition three, the target was switched to the female adult speaker.
154 Each condition was repeated three times successively, producing three trials per
155 condition. The presentation order of the three conditions was counterbalanced
7
156 across subjects. Each trial was of 220 s duration, divided into two 110 s partitions,
157 to reduce listener fatigue. To help participants attend to the correct speaker, the
158 first 30 s of each partition was replaced by the clean recording of the target
159 speaker alone, followed by a 5 s upward linear ramp of the background speakers.
160 Recordings of this first 35 s of each segment were not included in any analysis. To
161 further encourage the subjects to attend to the correct speaker, a target-word was
162 set before each trial and the subjects were asked to count the number of
163 occurrences of the target-word in the speech of the attended speaker. Additionally,
164 after each condition, the subject was asked to recount a short summary of the
165 attended narrative. The subjects were required to close their eyes while listening.
166 Before the main experiment, 100 repetitions of a 500-Hz tone pip were presented
167 to each subject to elicit the M100 response, a reliable auditory response occurring
168 ~100 ms after the onset of a tone pip. This data was used check whether any
169 potential subjects gave abnormal auditory responses, but no subjects were
171
172 Data recording and pre-processing MEG recordings were conducted using a 160-
174 Japan). Its detection coils are arranged in a uniform array on a helmet-shaped
175 surface of the bottom of the dewar, with ~25 mm between the centers of two
8
176 adjacent 15.5-mm-diameter coils. Sensors are configured as first-order axial
177 gradiometers with a baseline of 50 mm; their field sensitivities are 5 fT/Hz or
178 better in the white noise region. Subjects lay horizontally in a dimly lit
180 recorded with a sampling rate of 1 kHz with an online 200-Hz low-pass filter and
181 60 Hz notch filter. Three reference magnetic sensors and three vibrational sensors
182 were used to measure the environmental magnetic field and vibrations. The
183 reference sensor recordings were utilized to reduce environmental noise from the
184 MEG recordings using the Time-Shift PCA method (de Cheveigne and Simon,
185 2007). Additionally, MEG recordings were decomposed into virtual sensors/
186 components using denoising source separation (DSS) (Srel and Valpola, 2005;
187 de Cheveigne and Simon, 2008; de Cheveigne and Parra, 2014), a blind source
188 separation method that enhances neural activity consistent over trials. Specifically,
189 DSS decomposes the multichannel MEG recording into temporally uncorrelated
191 reliability, measured by the correlation between the responses to the same stimulus
192 in different trials. To reduce the computational complexity, for all further analysis
193 the 157 MEG sensors were reduced, using DSS, to 4 components in each
194 hemisphere. Also, both stimulus envelope and MEG responses were band pass
9
195 filtered between 1 8 Hz (delta and theta bands), which correspond to the slow
197
199 description, in each condition the subject attends to one among the three speech
201 contrasting the predicted envelope reconstructions of the different models. The
202 envelope of attended speech stream is referred to as the foreground and the
203 envelope of each of the two unattended speech streams is referred to as the
204 individual background. In contrast, the envelope of the entire unattended part of
205 the stimulus, comprising both unattended speech streams, is referred to as the
207 scene, comprising of all the three speech streams is referred to as the acoustic
209 acoustic scene. In contrast, the sum of envelopes of three speech streams,
211 two are not mathematically equal: even though both are functions of the same
212 stimuli, they differ due to the non-linear nature of a signal envelope (the linear
213 correlation between the acoustic scene and the sum of streams is typically ~0.75).
214 Combination (unsegregated) envelopes, whether of the entire acoustic scene or the
10
215 background only, can be used to test neural models that do not perform stream
216 segregation. Sums of individual stream envelopes, whether of all streams or just
217 the background streams, can be used to test neural models that process the
218 (segregated) streams in parallel, given that neurally generated magnetic fields add
220 Neural responses with latencies less than ~85 ms (typically originating
221 from core auditory areas) are referred to here as early neural responses and
222 responses with latencies more than ~85 ms (typically from higher-order auditory
223 areas) (Ahveninen et al., 2011; Okamoto et al., 2011; Steinschneider et al., 2011)
225 The next two sections describe models of the neural encoding of stimuli
226 into responses, followed by models of the decoding of stimuli from neural
227 responses. Encoding models are presented here first because of their ease of
228 description over decoding models, but in Results the decoding analysis is
229 presented first, since it is the decoding results that inform the new model of
230 encoding.
231
232 Temporal Response Function In an auditory scene with a single talker, the
233 relation between MEG neural response and the presented speech stimuli can be
11
(1)
235 where is time, is the response from any individual sensor or
236 DSS component, is the stimulus envelope in decibels, is the TRF
237 itself, and is residual response waveform not explained by the TRF model
238 (Ding and Simon, 2012a). The envelope is extracted by averaging the auditory
239 spectrogram, (Chi et al., 2005) along the spectral dimension. The TRF is estimated
240 using boosting with 10-fold cross-validation (David et al., 2007). In case of single
241 speech stimuli, the TRF is typically characterized by a positive peak between 30
242 ms and 80 ms and a negative peak between 90 ms and 130 ms, referred to as
243 M50TRF and M100TRF respectively (Ding and Simon, 2012b) (positivity/negativity
244 of the magnetic field is by convention defined to agree with the corresponding
246 evaluated by how well it predicts neural responses, as measured by the proportion
247 of the variance explained: the square of the Pearson correlation coefficient
248 between the MEG measurement and the TRF model prediction.
249 In the case of more than one speaker, the MEG neural response, can be
250 modeled as the sum of the responses to the individual acoustic sources (Ding and
251 Simon, 2012b; Zion Golumbic et al., 2013b), referred to here as the 'Summation
252 model'. For example, with three speech streams, the neural response would be
253 modeled as
12
(2)
254
255 where , and are the envelopes of the three speech streams, and
257 represents the length of TRF. All TRFs in the Summation model are estimated
258 simultaneously.
260 model referred to as the Early-late model, which allows one to incorporate the
261 hypothesis that the early neural responses typically represent the entire acoustic
262 scene, but that the later neural responses differentially represent the separated
(3)
264
266 (foreground) speech stream, and is the combined background (i.e., envelope
267 of everything other than attended speech stream in the auditory scene), and
269 boundary values of the integration windows for early and late neural responses
13
271 The explanatory power of different models, such as the Summation and
272 Early-late models, can be ranked by comparing the accuracy of their response
273 predictions (illustrated in Figure 1A). The models differ in terms of number of free
274 parameters, with the Early-late model having fewer parameters than the
275 Summation model. Hence, any improved performance observed in the proposed
276 Early-late model over the Summation model is correspondingly more likely due to
277 a better model fit, since it has less freedom to fit the data (though the converse
279
281
282 Decoding speech from neural responses While the TRF/encoding analysis
283 described in the previous section predicts neural response from the stimulus,
284 decoding analysis reconstructs the stimulus based on the neural response. Thus,
285 decoding analysis complements the TRF analysis (Mesgarani et al., 2009).
287 as
(4)
288
14
289 where is the reconstructed envelope, is the MEG recording (neural
291 sensor/component k. The times and denote the beginning and end times of
293 envelope reconstructions using neural responses from any desired time window
294 can be compared. The decoder is estimated using boosting analogously to the TRF
295 estimation in the previous section. In the single talker case the envelope is of that
297 the envelope of the speech of attended talker, or one of the background talkers, or
298 of a mixture of any two or all three talkers, depending on the model under
299 consideration. Chance-level reconstruction (i.e., the noise floor) from a particular
301 from that neural response. Figure 2Figure 2 illustrates the distinction between
302 reconstruction of stimulus envelope from early and late responses. The stimulus
303 envelope at time point t can be reconstructed using neural responses from the
304 dashed (early response) window or dotted (late response) window. (While it is true
305 that the late responses to the stimulus at time point t t overlap with early
306 responses to the stimulus at time point t, the decoder used to reconstruct the
307 stimulus at time point t from early responses is only minimally affected by late
308 responses to the stimulus at time point t t when the decoder is estimated by
15
309 averaging over a long enough duration, e.g., tens of seconds). The cut-off time
310 between early and late responses, , was chosen to minimize the overlap
311 between the M50TRF and M100TRF peaks, on a per subject basis, with a median
313 the single value of 85 ms for all subjects did not qualitatively change any
314 conclusions. When decoding from early responses only, the time window of
318
320
324
325 Statistics All statistical comparisons reported here are two-tailed permutation tests
326 with N=1,000,000 random permutations (within subject). Due to the value of N
327 selected, the smallest accurate p value that can be reported is 21/N (= 210-6; the
328 factor of 2 arises from the two-tailed test) and any p value smaller than 2/N is
16
329 reported as p < 210-6. The statistical comparison between foreground and
330 individual backgrounds requires special mention, since each listening condition
331 has one foreground but two individual backgrounds. From the perspective of both
332 behavior and task, both the individual backgrounds are interchangeable. Hence,
334 the average reconstruction accuracy of the two individual backgrounds is used.
335 Finally, Bayes factor analysis is used, when appropriate, to evaluate evidence in
336 favor of null hypothesis, since conventional hypothesis testing is not suitable for
337 such purposes. Briefly, Bayes factor analysis calculates the posterior odds i.e., the
(5)
(6)
341 factor, BF01. Then, under the assumption of equal priors (P(H0) = P(H1)), the
342 posterior odds reduces to BF01. A BF01 value of 10 indicates that the data is ten
343 times more likely to occur under the null hypothesis than the alternate hypothesis;
344 conversely, a BF01 value of 0.1 indicates that the data is 10 times more likely to
17
345 occur under the alternate hypothesis than the null hypothesis. Conventionally, a
346 BF01 value between 3 and 10 is considered as moderate evidence in favor of the
347 null hypothesis, and a value between 10 and 30 is considered strong evidence;
348 conversely, a BF01 value between 1/3 & 1/10 (respectively 1/10 & 1/30) is
349 considered moderate (respectively strong) evidence for the alternate hypothesis
350 (for more details we refer the reader to Rouder et al. (2009)).
351
352 Results
354 To investigate the neural representations of the attended vs. unattended speech
355 streams associated with early auditory areas, i.e., from core auditory cortex,
356 (Nourski et al., 2014), the temporal envelope of attended (foreground) and
358 decoders optimized individually for each speech stream. All reconstructions
359 performed significantly better than chance level (foreground vs. noise, p < 210-6;
360 individual background vs. noise, p < 210-6), indicating that all three speech
361 streams are represented in early auditory cortex. Figure 3Figure 3A shows
18
365 bias for the attended speech stream over the ignored speech stream, in early neural
366 responses. In fact, Bayes Factor analysis (BF01 = 4.2) indicates moderate support
367 in favor of the null hypothesis (Rouder et al., 2009), that early neural responses do
368 not distinguish significantly between attended and ignored speech streams.
369
371
372 To test the hypothesis that early auditory areas represent the auditory scene
374 the acoustic scene (the envelope of the sum of all three speech streams) and
376 reconstruction envelopes of each of the three individual speech streams). Separate
377 decoders optimized individually were used to reconstruct the acoustic scene and
378 the sum of streams. As can be seen in Figure 3Figure 3B, the result shows that the
379 acoustic scene is better reconstructed than the sum of streams (p < 210-6). This
380 indicates that early auditory cortex is better described as processing the entire
381 acoustic scene rather than processing the separate elements of the scene
382 individually.
383
19
385 While the preceding results were based on early cortical processing, the following
386 results are based on late auditory cortical processing (responses with latencies
387 more than ~85 ms). Figure 4Figure 4A shows the scatter plot of reconstruction
388 accuracy for the foreground vs. individual background envelopes based on late
389 responses. A paired permutation test shows that reconstruction accuracy for the
390 foreground is significantly higher than the background (p < 210-6). Even though
391 the individual backgrounds are not as reliably reconstructed as foreground, their
392 reconstructions are nonetheless significantly better than chance level (p < 210-6).
395 entire background as a whole, with the reconstructability of the sum of the
397 auditory object (i.e., the background), the reconstruction of the envelope of the
398 entire background should be more faithful than the sum of envelopes of individual
400 objects, each distinguished by its own envelope, the reconstruction of the sum of
401 envelopes of the individual backgrounds should be more faithful. Figure 4Figure
402 4B shows the scatter plot of reconstruction accuracy for the envelope of combined
403 background vs. the sum of the envelopes of the individual background streams.
404 Analysis shows that the envelope of the combined background is significantly
20
405 better represented than the sum of the individual envelopes of the individual
407 background is actually strongly correlated with the sum of the envelopes of the
409 reconstruction accuracy is a priori unlikely, providing even more credence to the
410 result.
411
413
415 Results above from envelope reconstruction suggest that while early neural
416 responses represent the auditory scene in terms of the acoustics, the later neural
417 responses represent the auditory scene in terms of a separated foreground and a
418 single background stream. In order to further test this hypothesis, we use TRF-
419 based encoding analysis to directly compare two different models of auditory
420 scene representations. The two models compared are the standard Summation
421 model (based on parallel representations of all speech streams; see Equation 2) and
422 the new Early-late model (based on an early representation of the entire acoustic
423 scene and late representations of separated foreground and background; see
424 Equation 3). Figure 5Figure 5 shows the response prediction accuracies for the
21
425 two models. A permutation test shows that the accuracy of the Early-late model is
426 considerably higher than that of the Summation model (p < 210-6). This indicates
427 that a model in which early/core auditory cortex processes the entire acoustic
428 scene but later/higher-order auditory cortex processes the foreground and
429 background separately has more support than the previously employed model of
431
433 Discussion
435 scenario, to investigate the neural representations of an auditory scene. From MEG
436 recordings of subjects selectively attending to one of the three co-located speech
437 streams, we observed that 1) The early neural responses (from sources with short
438 latencies), which originate primarily from core auditory cortex, represent the
439 foreground (attended) and background (ignored) speech streams without any
440 significant difference, whereas the late neural responses (from sources with longer
441 latencies), which originate primarily from higher-order areas of auditory cortex,
442 represent the foreground with significantly higher fidelity than the background; 2)
443 Early neural responses are not only balanced in how they represent the constituent
22
444 speech streams, but in fact represent the entire acoustic scene holistically, rather
445 than as separately contributing individual perceptual objects; 3) Even though there
446 are two physical speech streams in the background, no neural segregation is
450 anatomical areas at different latencies (Inui et al., 2006; Nourski et al., 2014).
451 Using this idea to inform the neural decoding/encoding analysis allows the
452 effective isolation of neural signals from a particular cortical area, and thereby the
457 though different response components are generated by different neural sources,
458 standard neural source localization algorithms may perform poorly when different
459 sources are strongly correlated in their responses (Lutkenhoner and Mosher,
460 2007). While the proposed method is not to be viewed as an alternative to source
23
463 Even though there is no significant difference between the ability to
464 reconstruct the foreground and background from early neural responses,
466 representation of the foreground (foreground > background, p = 0.21). This could
468 mammalian primary auditory cortex (Fritz et al., 2003), where the receptive fields
469 of neurons are tuned to match the stimulus characteristics of attended sounds. The
471 auditory cortices) but not in early responses (from core auditory cortex) observed
472 here using decoding is in agreement with the encoding result of Ding and Simon
473 (2012b) where the authors showed that the late M100TRF component, but not the
475 increase in fidelity of the foreground as the response latency increases indicates a
478 representation for the attended speech stream has been demonstrated, albeit with
479 only two speech streams and not differentiating between early and late responses,
480 using delta and theta band neural responses (Ding and Simon, 2012b; Zion
481 Golumbic et al., 2013a; Zion Golumbic et al., 2013b) as well as high-gamma
482 neural responses (Mesgarani and Chang, 2012; Zion Golumbic et al., 2013a), and
24
483 using monaural (Ding and Simon, 2012b; Mesgarani and Chang, 2012) as well as
484 audio-visual speech (Zion Golumbic et al., 2013a; Zion Golumbic et al., 2013b).
486 Lakatos, 2009; Ng et al., 2012; Zion Golumbic et al., 2013b; Kayser et al., 2015)
487 as the mechanism for selective tracking of attended speech, others suggest a
488 temporal coherence model (Shamma et al., 2011; Ding and Simon, 2012b).
489 Natural speech is quasi-rhythmic with dominant rates at syllabic, word and
490 prosodic frequencies. The selective entrainment model suggests that attention
491 causes endogenous low frequency neural oscillations to align with the temporal
492 structure of the attended speech stream, thus aligning the high excitability phases
493 of oscillations with events in attended stream. This effectively forms a mask that
494 favors the attended speech. The temporal coherence model suggests that selective
495 tracking of attended speech is achieved in two stages. First, a cortical filtering
498 This is followed by a second stage, coherence analysis, which combines relevant
499 features streams based on their temporal similarity, giving rise to separate
500 perceptions of attended and ignored streams. In this model, it is hypothesized that
501 attention, acting through in the coherence analysis stage, plays an important role in
25
502 stream formation. This type of coherence model predicts an unsegregated
506 than object-based (e.g., Figure 3Figure 3B). This is further supported by the result
507 that the Early-late model predicts MEG neural responses significantly better than
508 Summation model (e.g., Figure 5Figure 5). This is consistent with previous
509 demonstrations that neural activity in core auditory cortex was highly sensitive to
511 of sound (Nourski et al., 2009; Okada et al., 2010; Ding and Simon, 2013;
512 Steinschneider et al., 2014). In contrast, Nelken and Bar-Yosef (2008) suggest that
513 neural auditory objects may form as early as primary auditory cortex, and Fritz et
514 al. (2003) show that representations of dynamic sounds in primary auditory cortex
515 are influenced by task. As a working principle, it is possible that less complex
516 stimuli are resolved earlier in the hierarchy of auditory pathway (e.g., sounds that
517 can be separated via tonotopy) whereas more complex stimuli (e.g., concurrent
518 speech streams), which need further processing, are resolved only much later in
519 auditory pathway. In addition, it is worth noting that the current study uses co-
520 located speech streams, whereas mechanisms of stream segregation will also be
521 influenced by other auditory cues, including spatial cues, differences in acoustic
26
522 source statistics (e.g., only speech streams vs. mixed speech and music; strong
523 statistical differences might drive stream segregation in a more bottom-up manner
524 than the top-down attentional effects studied here), perceptual load effects (e.g.,
525 tone streams vs. speech streams), as well as visual cues. Any of these additional
526 cues has the potential to alter the timing and neural mechanisms by which auditory
529 objects (Bregman, 1994; Griffiths and Warren, 2004; Shinn-Cunningham, 2008;
530 Shamma et al., 2011). Ding and Simon (2012a) demonstrated evidence for an
531 object-based cortical representation of an auditory scene, but did not distinguish
532 between early and late neural responses. This, coupled with the result here that
534 strongly suggest that the object-based representation emerges only in the late
535 neural responses/higher-order (belt and parabelt) auditory areas. This is further
537 representation, is observed in higher order areas but not in core auditory cortex
538 (Chang et al., 2010; Okada et al., 2010). When the foreground is represented as an
539 auditory object in late neural responses, the finding that the combined background
541 (Figure 4Figure 4B) suggests that in late neural responses the background is not
27
542 represented as separated and distinct auditory objects. This result is consistent with
543 that of Sussman et al. (2005), who reported an unsegregated background when
544 subjects attended to one of three tone streams in the auditory scene. This
546 Kersten, 2006; Poeppel et al., 2008) mechanism, wherein the auditory scene is first
547 decomposed into basic acoustic elements, followed by top-down processes that
548 guide the synthesis of the relevant components into a single stream, which then
549 becomes the object of attention. The remainder of the auditory scene would be the
550 unsegregated background, which itself might have the properties of an auditory
551 object. When attention shifts, new auditory objects are correspondingly formed,
552 with the old ones now contributing to the unstructured background. Shamma et al.
553 (2011) suggest that this top down influence acts through the principle of temporal
554 coherence. Between the two opposing views, that streams are formed pre-
555 attentively and that multiple streams can co-exist simultaneously, or that attention
556 is required to form a stream and only that single stream is ever present as
557 separated perceptual entity, these findings lend support to the latter.
559 scene with multiple overlapping spectral and temporal sources, the core areas of
560 auditory cortex maintains an acoustic representation of the auditory scene with no
561 significant preference to attended over ignored source, and with no separation into
28
562 distinct sources. It is only the higher-order auditory areas that provide an object
563 based representation for the foreground, but even there the background remains
564 unsegregated.
565
566 References
567 Ahveninen J, Hamalainen M, Jaaskelainen IP, Ahlfors SP, Huang S, Lin FH, Raij T,
568 Sams M, Vasios CE, Belliveau JW (2011) Attention-driven auditory cortex short-
569 term plasticity helps segregate relevant sounds from noise. Proc Natl Acad Sci U S
570 A 108:4182-4187.
571 Bregman AS (1994) Auditory scene analysis: The perceptual organization of sound: MIT
572 press.
573 Carlyon RP (2004) How the brain separates sounds. Trends Cogn Sci 8:465-471.
574 Chang EF, Rieger JW, Johnson K, Berger MS, Barbaro NM, Knight RT (2010)
575 Categorical speech representation in human superior temporal gyrus. Nat Neurosci
576 13:1428-1432.
577 Cherry EC (1953) Some Experiments on the Recognition of Speech, with One and with 2
29
581 David SV, Mesgarani N, Shamma SA (2007) Estimating sparse spectro-temporal
589 de Cheveigne A, Parra LC (2014) Joint decorrelation, a versatile tool for multichannel
591 Di Liberto GM, O'Sullivan JA, Lalor EC (2015) Low-Frequency Cortical Entrainment to
593 Ding N, Simon JZ (2012a) Neural coding of continuous speech in auditory cortex during
595 Ding N, Simon JZ (2012b) Emergence of neural encoding of auditory objects while
30
599 Fritz J, Shamma S, Elhilali M, Klein D (2003) Rapid task-related plasticity of
600 spectrotemporal receptive fields in primary auditory cortex. Nat Neurosci 6:1216-
601 1223.
602 Griffiths TD, Warren JD (2002) The planum temporale as a computational hub. Trends
604 Griffiths TD, Warren JD (2004) What is an auditory object? Nature Reviews
606 Hickok G, Poeppel D (2007) The cortical organization of speech processing. Nat Rev
608 Inui K, Okamoto H, Miki K, Gunji A, Kakigi R (2006) Serial and parallel processing in
609 the human auditory cortex: a magnetoencephalographic study. Cereb Cortex 16:18-
610 30.
611 Kaas JH, Hackett TA (2000) Subdivisions of auditory cortex and processing streams in
613 Kayser C, Wilson C, Safaai H, Sakata S, Panzeri S (2015) Rhythmic auditory cortex
614 activity at multiple timescales shapes stimulus-response gain and background firing.
616 Kerlin JR, Shahin AJ, Miller LM (2010) Attentional gain control of ongoing cortical
31
618 Lalor EC, Power AJ, Reilly RB, Foxe JJ (2009) Resolving precise temporal processing
619 properties of the auditory system using continuous stimuli. J Neurophysiol 102:349-
620 359.
621 Lutkenhoner B, Mosher JC (2007) Source Analysis of Auditory Evoked Potentials and
622 Fields. In: Auditory evoked potentials : basic principles and clinical application
623 (Burkard RF, Eggermont JJ, Don M, eds), pp xix, 731 p., 716 p. of plates.
625 McDermott JH (2009) The cocktail party problem. Curr Biol 19:R1024-1027.
628 Mesgarani N, David SV, Fritz JB, Shamma SA (2009) Influence of context and behavior
631 Nelken I, Bar-Yosef O (2008) Neurons and objects: the case of auditory cortex. Front
633 Ng BS, Schroeder T, Kayser C (2012) A precluding but not ensuring role of entrained
32
635 Nourski KV, Steinschneider M, McMurray B, Kovach CK, Oya H, Kawasaki H, Howard
636 MA, 3rd (2014) Functional organization of human auditory cortex: investigation of
638 Nourski KV, Reale RA, Oya H, Kawasaki H, Kovach CK, Chen H, Howard MA, 3rd,
641 O'Sullivan JA, Power AJ, Mesgarani N, Rajaram S, Foxe JJ, Shinn-Cunningham BG,
642 Slaney M, Shamma SA, Lalor EC (2015) Attentional Selection in a Cocktail Party
643 Environment Can Be Decoded from Single-Trial EEG. Cereb Cortex 25:1697-1706.
644 Okada K, Rong F, Venezia J, Matchin W, Hsieh IH, Saberi K, Serences JT, Hickok G
645 (2010) Hierarchical organization of human auditory cortex: evidence from acoustic
647 Okamoto H, Stracke H, Bermudez P, Pantev C (2011) Sound processing hierarchy within
649 Overath T, McDermott JH, Zarate JM, Poeppel D (2015) The cortical analysis of speech-
650 specific temporal structure revealed by responses to sound quilts. Nat Neurosci
651 18:903-911.
33
652 Pasley BN, David SV, Mesgarani N, Flinker A, Shamma SA, Crone NE, Knight RT,
653 Chang EF (2012) Reconstructing speech from human auditory cortex. PLoS Biol
654 10:e1001251.
655 Peelle JE, Johnsrude IS, Davis MH (2010) Hierarchical processing for speech in human
657 Poeppel D, Idsardi WJ, van Wassenhove V (2008) Speech perception at the interface of
658 neurobiology and linguistics. Philos Trans R Soc Lond B Biol Sci 363:1071-1086.
659 Power AJ, Foxe JJ, Forde EJ, Reilly RB, Lalor EC (2012) At what time is the cocktail
660 party? A late locus of selective attention to natural speech. Eur J Neurosci 35:1497-
661 1503.
662 Rauschecker JP, Scott SK (2009) Maps and streams in the auditory cortex: nonhuman
664 Recanzone GH, Guard DC, Phan ML (2000) Frequency and intensity response properties
665 of single neurons in the auditory cortex of the behaving macaque monkey. J
667 Rouder JN, Speckman PL, Sun D, Morey RD, Iverson G (2009) Bayesian t tests for
668 accepting and rejecting the null hypothesis. Psychon Bull Rev 16:225-237.
669 Srel J, Valpola H (2005) Denoising source separation. Journal of Machine Learning
34
671 Schroeder CE, Lakatos P (2009) Low-frequency neuronal oscillations as instruments of
673 Shamma SA, Elhilali M, Micheyl C (2011) Temporal coherence and attention in auditory
675 Shinn-Cunningham BG (2008) Object-based auditory and visual attention. Trends Cogn
678 their utility in the assessment of complex sound processing. In: The auditory cortex,
680 Steinschneider M, Nourski KV, Rhone AE, Kawasaki H, Oya H, Howard MA, 3rd
681 (2014) Differential activation of human core, non-core and auditory-related cortex
684 Sussman ES, Bregman AS, Wang WJ, Khan FJ (2005) Attentional modulation of
687 Sweet RA, Dorph-Petersen KA, Lewis DA (2005) Mapping auditory core, lateral belt,
688 and parabelt cortices in the human superior temporal gyrus. J Comp Neurol
689 491:270-289.
35
690 Yuille A, Kersten D (2006) Vision as Bayesian inference: analysis by synthesis? Trends
692 Zion Golumbic E, Cogan GB, Schroeder CE, Poeppel D (2013a) Visual input enhances
695 Zion Golumbic EM, Poeppel D, Schroeder CE (2012) Temporal context in speech
696 processing and attentional stream selection: a behavioral and neural perspective.
698 Zion Golumbic EM, Ding N, Bickel S, Lakatos P, Schevon CA, McKhann GM,
699 Goodman RR, Emerson R, Mehta AD, Simon JZ, Poeppel D, Schroeder CE (2013b)
702
36
703 Figure Legends:
705 based neural representations of the auditory scene and its constituents. All
706 examples are grand averages across subjects (3 seconds duration). A. Comparing
707 competing models of encoding to neural responses. In both the top and bottom
709 neural response predictions made by competing proposed models. In the top
710 example, the neural response prediction (red) is from the Early-late model; in the
711 bottom example, the neural response prediction (magenta) is from the Summation
712 model. The proposed Early-late model prediction shows higher correlation with
713 the actual MEG neural response than Summation model. B. Comparing competing
714 models of decoding to stimulus speech envelopes. In both the top and bottom
716 model reconstruction of the respective envelope (gray). In the top example, the
717 envelope reconstruction (blue) is of the foreground stimulus, based on late time
718 responses; in the bottom example, the envelope reconstruction (cyan) is of the
719 background stimulus, also based on late time responses. The foreground
720 reconstruction shows higher correlation with the actual foreground envelope,
721 compared to the background reconstruction with the actual background envelope.
722
37
723 Figure 2: Early vs. late MEG neural responses to a continuous speech stimulus. A
724 sample stimulus envelope and time-locked multi-channel MEG recordings are
725 shown in red and black respectively. The two grey vertical lines indicate two
726 arbitrary time points at t - 't and t. The dashed and dotted boxes represent the
727 early and late MEG neural responses to stimulus at time point t respectively. The
728 reconstruction of the stimulus envelope at time t can be based on either early or
729 late neural responses, and the separate reconstructions can be compared against
731
732
733 Figure 3: Stimulus envelope reconstruction accuracy using early neural responses.
737 individual background streams is revealed in early neural responses. Each data
738 point corresponds to a distinct background and condition partition per subject
740 reconstruction accuracy of the envelope of the entire acoustic scene vs. that of the
741 sum of the envelopes of all three individual speech streams. The acoustic scene is
742 reconstructed more accurately (visually, most of data points fall above the
38
743 diagonal) as a whole than as the sum of individual components in early neural
744 responses (p < 2 10-6). Each data point corresponds to a distinct condition
746 proportion of the variance explained: the square of the Pearson correlation
748
749 Figure 4: Stimulus envelope reconstruction accuracy using late neural responses.
752 dramatically better fidelity (visually, most of data points fall above the diagonal)
753 than the background speech, in late neural responses (p < 2 10-6). Each data
754 point corresponds to a distinct background and condition partition per subject
755 (with two backgrounds sharing a common foreground). B. Scatter plot of the
756 reconstruction accuracy of the envelope of the entire background vs. that of the
757 sum of the envelopes of the two individual background speech streams. The
760 0.012). Each data point corresponds to a distinct condition partition per subject.
761
39
762 Figure 5: MEG response prediction accuracy. Scatter plot of the accuracy of
763 predicted MEG neural response for the proposed Early-late model vs. the standard
764 Summation model. The Early-late model predicts the MEG neural response
765 dramatically better (visually, most of data points fall above the diagonal) than the
766 Summation model (p < 2 10-6). The accuracy of predicted MEG neural
767 responses is measured by proportion of the variance explained: the square of the
768 Pearson correlation coefficient between the actual and predicted responses. Each
770
40
Neural Response Prediction Stimulus Reconstruction from
A
from Stimulus B
Neural Responses
771
772
773 Figure 1
41
774
775
776 Figure 2
42
Stimulus Reconstruction Accuracy from Early Neural Responses
A B
0.15 0.15
Acoustic scene
Foreground
0.1 0.1
0.05 0.05
0 0
0 0.05 0.1 0.15 0 0.05 0.1 0.15
777 Individual background Sum of streams
778
779 Figure 3
43
Stimulus Reconstruction Accuracy from Late Neural Responses
A B
0.15 0.15
Background
Foreground
0.1 0.1
0.05 0.05
0 0
0 0.05 0.1 0.15 0 0.05 0.1 0.15
780 Individual background Sum of individual backgrounds
781
782 Figure 4
44
Neural Response Prediction
from Competing Models
0.2
Early-late model
0.15
0.1
0.05
0
0 0.05 0.1 0.15 0.2
783 Summation model
784
785 Figure 5
45