Physiological data suggest that a two-dimensional signal representation of amplitude-modulation frequency against center frequency is extracted in the central nucleus of the inferior colliculus [C. E. Schreiner and G. Langner, J. Neurophysiol. 60, 1823–1840 (1988)]. The representation groups signals with common harmonics, and can easily be adapted to also enhance common onsets. This makes it an attractive basis for a representational model of auditory scene analyis. The map is used as a front end to a neural network pattern matching stage for vowel recognition. The model is tested against human performance in a concurrent vowel recognition task [P. F. Assmann and Q. Summerfield, J. Acoust. Soc. Am. 88, 680–697 (1990)]. It predicts human performance for a F0 grouping task, even for vowels with up to a 12-dB intensity difference, well. Another grouping cue, demonstrated in abstract sounds, is common onset. The double-vowel task was modified to include onset asynchrony. A model based on the change in the AM-map representation between successive frames predicts human performance well. It is possible to directly drive a pattern matching stage with the AM map representation removing the need for an explicit grouping processes. The model is inherently robust when driven by stimuli that violate grouping cues.
Read full abstract