Abstract

How speech signals are analyzed and represented remains a foundational challenge both for cognitive science and neuroscience. A growing body of research, employing various behavioral and neurobiological experimental techniques, now points to the perceptual relevance of both phoneme-sized (10–40 Hz modulation frequency) and syllable-sized (2–10 Hz modulation frequency) units in speech processing. However, it is not clear how information associated with such different time scales interacts in a manner relevant for speech perception. We report behavioral experiments on speech intelligibility employing a stimulus that allows us to investigate how distinct temporal modulations in speech are treated separately and whether they are combined. We created sentences in which the slow (~4 Hz; Slow) and rapid (~33 Hz; Shigh) modulations—corresponding to ~250 and ~30 ms, the average duration of syllables and certain phonetic properties, respectively—were selectively extracted. Although Slow and Shigh have low intelligibility when presented separately, dichotic presentation of Shigh with Slow results in supra-additive performance, suggesting a synergistic relationship between low- and high-modulation frequencies. A second experiment desynchronized presentation of the Slow and Shigh signals. Desynchronizing signals relative to one another had no impact on intelligibility when delays were less than ~45 ms. Longer delays resulted in a steep intelligibility decline, providing further evidence of integration or binding of information within restricted temporal windows. Our data suggest that human speech perception uses multi-time resolution processing. Signals are concurrently analyzed on at least two separate time scales, the intermediate representations of these analyses are integrated, and the resulting bound percept has significant consequences for speech intelligibility—a view compatible with recent insights from neuroscience implicating multi-timescale auditory processing.

Highlights

  • IntroductionA central issue in psycholinguistics, psychoacoustics, speech research, and auditory cognitive neuroscience concerns the range of cues essential for understanding spoken language and how they are extracted by the brain (Greenberg, 2005; Pardo and Remez, 2006; Cutler, 2012).In the domains of psycholinguistics and speech perception, phonetic segments or articulatory features (e.g., Liberman and Mattingly, 1985; Stevens, 2002) and syllables (Dupoux, 1993; Greenberg and Arai, 2004) have been identified as fundamental speech units

  • We demonstrate that presentation of sentences containing low-frequency modulations alone (Slow) with Shigh results in significantly better intelligibility compared to the presentation of each signal separately

  • Filter parameters were chosen to encompass the modulation frequencies shown to be most relevant for speech: 4 Hz (∼250-ms-sized temporal windows) in the Slow condition and 33 Hz (∼30 ms temporal windows) in the Shigh condition

Read more

Summary

Introduction

A central issue in psycholinguistics, psychoacoustics, speech research, and auditory cognitive neuroscience concerns the range of cues essential for understanding spoken language and how they are extracted by the brain (Greenberg, 2005; Pardo and Remez, 2006; Cutler, 2012).In the domains of psycholinguistics and speech perception, phonetic segments or articulatory features (e.g., Liberman and Mattingly, 1985; Stevens, 2002) and syllables (Dupoux, 1993; Greenberg and Arai, 2004) have been identified as fundamental speech units. The temporal envelope of speech, which reflects amplitude modulation associated with articulator movement during speech production, has been a focus of intense investigation These fluctuations in amplitude, at rates between 2 and 50 Hz, are thought to carry information related to phoneticsegment duration and identity, syllabification, and stress (Rosen, 1992; Greenberg, 2005). It is evident from various psychophysical studies under a range of listening conditions that the integrity of the temporal envelope is highly correlated with the ability to understand speech (Houtgast and Steeneken, 1985; Drullman et al, 1994a,b; Chi et al, 1999; Greenberg and Arai, 2004; Obleser et al, 2008; Elliott and Theunissen, 2009; Ghitza, 2012; Peelle et al, 2013; Doelling et al, 2014). A striking demonstration of listeners’ ability to utilize such cues is provided by Shannon et al (1995): excellent speech comprehension can be achieved by dividing the speech signal into as few as four frequency bands, extracting their temporal envelopes, and using these to modulate Gaussian noise of comparable bandwidth

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call