Speech understanding involves the integration and identification of acoustic cues that are distributed over multiple time scales. These range from the sub-millisecond intervals associated with spectral estimates, to the few-millisecond periods of the fundamental frequency (f0), to the tens of milliseconds spanning phonemic and syllabic segments, and the longer time scales involved in perceiving words and sentences. Much of what is known about the auditory representation of these cues comes from experimental studies in various animal species. Especially, well studied are the early stages of the cochlea and cochlear nucleus, and the later cortical stages (Sachs & Young, 1979; Young & Sachs, 1979; Young, 1997, Chap. 4; Clarey, Barone, & Imig, 1992; Calhoun & Schreiner, 1995; Shamma Versnel, & Kowalski, 1995; Kowalski, Depireux, & Shamma, 1996; deCharms, Blake, & Merzenich 1998). By contrast, the physiological underpinnings of the linguistic processes remain highly elusive despite extensive investigations employing a host of new human fast-imaging technologies and computational models over the last decade (Poeppel, 2001; Horwitz, Friston, & Taylor, 2000). These techniques do not yet have the resolution to give a clear insight into single units and the neural circuits and their responses and representations. Consequently, the review below concerns conceptions of auditory processes operating at the faster time scales found in the earlier auditory pathway where animal experimentation is possible. Furthermore, they are based on extrapolations from experiments that employ simpler stimuli than speech (such as tones and noise with various amplitude and frequency modulations), and hence the models discussed are not specific to speech perception. Temporal integration in the auditory system actually refers to integration of spectro-temporal features over several stages, giving rise to varied forms of spectro-temporal selectivity that have been deemed valuable for speech processing. One example is the selectivity to speed and direction of frequency-modulated (FM) tones that resemble formant transitions in speech (Nelken & ARTICLE IN PRESS
Read full abstract