Speak, memory—Wherefore art thou, invariance?

Steven Greenberg

doi:10.1121/1.3654514

Abstract

Spoken language is highly variable, reflecting factors of environmental (e.g., acoustic-background noise, reverberation), linguistic (e.g., speaking-style), and idiosyncratic (e.g., voice-quality) origin. Despite such variability, listeners rarely experience difficulty understanding speech. What brain mechanisms underlie this perceptual resilience, and where does the invariance reside (if anywhere) that enables the signal to be reliably decoded and understood? A theoretical framework—DejaNets—is described for how the brain may go from “sound to meaning.” Key is speech representations in memory, crucial for the parsing, analysis, and interpretation of sensory signals. The acoustic waveform is viewed as inherently ambiguous, its interpretation dependent on combining data streams, some sensory (e.g., visual-speech cues), others internal, derived from memory and knowledge schema. This interpretative process is mediated by a hierarchical network of neural oscillators spanning a broad range of time constants (ca. 15–2 000 ms), consistent with the time course and temporal structure of spoken language. They reflect data-fetching, parsing, and pattern-matching involved in decoding and interpreting the speech signal. DejaNets accounts for many (otherwise) paradoxical and mysterious properties of spoken language including categorical perception, the McGurk effect, phonemic restoration, semantic context, and robustness/sensitivity to variation in pronunciation, speaking rate and the ambient acoustic environment. [Work supported by AFOSR.]

Full Text