Abstract

In speech, listeners extract continuously-varying spectrotemporal cues from the acoustic signal to perceive discrete phonetic categories. Spectral cues are spatially encoded in the amplitude of responses in phonetically-tuned neural populations in auditory cortex. It remains unknown whether similar neurophysiological mechanisms encode temporal cues like voice-onset time (VOT), which distinguishes sounds like /b/ and/p/. We used direct brain recordings in humans to investigate the neural encoding of temporal speech cues with a VOT continuum from /ba/ to /pa/. We found that distinct neural populations respond preferentially to VOTs from one phonetic category, and are also sensitive to sub-phonetic VOT differences within a population's preferred category. In a simple neural network model, simulated populations tuned to detect either temporal gaps or coincidences between spectral cues captured encoding patterns observed in real neural data. These results demonstrate that a spatial/amplitude neural code underlies the cortical representation of both spectral and temporal speech cues.

Highlights

  • During speech perception, listeners must extract acoustic cues from a continuous sensory signal and map them onto discrete phonetic categories, which are relevant for meaning (Stevens, 2002; Liberman et al, 1967)

  • Temporal cues to voicing category are encoded in spatially distinct neural populations To investigate neural activity that differentiates the representation of speech sounds based on a temporal cue like voice-onset time (VOT), we recorded high-density electrocorticography in seven participants while they listened to the VOT continuum

  • We found discrete neural populations located primarily on the bilateral posterior and middle superior temporal gyrus (STG) that respond preferentially to either voiced sounds, where the onset of voicing is coincident with the burst or follows it after a short lag (20 ms or less), or voiceless sounds, where the onset of voicing follows the burst after a temporal gap of at least 30–50 ms

Read more

Summary

Introduction

Listeners must extract acoustic cues from a continuous sensory signal and map them onto discrete phonetic categories, which are relevant for meaning (Stevens, 2002; Liberman et al, 1967). Distinct neural populations in this region respond selectively to different classes of phonemes that share certain spectral cues, such as the burst associated with stop consonants or the characteristic formant structure of vowels produced with specific vocal tract configurations It is unclear whether phonetic categories distinguished by temporal cues (e.g., voiced vs voiceless stops) are represented within an analogous spatial encoding scheme. Peak response amplitude is modulated by stimulus VOT within each population’s preferred – but not its non-preferred – voicing category (e.g., stronger response to 0 ms than to 10 ms VOT in voiced-selective [/b/-selective] neural populations) This same encoding scheme emerged in a computational neural network model simulating neuronal populations as leaky integrators tuned to detect either temporal coincidences or gaps between distinct spectral cues. This represents a crucial step towards a unified model of cortical speech encoding, demonstrating that both spectral and temporal cues and both phonetic and sub-phonetic information are represented by a common (spatial) neural code

Results
C Identification Behavior
Discussion
Data and code availability
Participants
Behavioral procedure
Sum Inputs
Funding Funder National Institutes of Health
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call