An automatic speech recognition system based on the scene analysis account of auditory perception

André Coy,Jon Barker

doi:10.1016/j.specom.2006.11.002

Abstract

Despite many years of concentrated research, the performance gap between automatic speech recognition (ASR) and human speech recognition (HSR) remains large. The difference between ASR and HSR is particularly evident when considering the response to additive noise. Whereas human performance is remarkably robust, ASR systems are brittle and only operate well within the narrow range of noise conditions for which they were designed. This paper considers how humans may achieve noise robustness. We take the view that robustness is achieved because the human perceptual system treats the problems of speech recognition and sound source separation as being tightly coupled. Taking inspiration from Bregman’s Auditory Scene Analysis account of auditory organisation, we present a speech recognition system which couples these processes by using a combination of primitive and schema-driven processes: first, a set of coherent spectro-temporal fragments is generated by primitive segmentation techniques; then, a decoder based on statistical ASR techniques performs a simultaneous search for the correct background/foreground segmentation and word sequence hypothesis. Mutually supporting solutions to both the source segmentation and speech recognition problems arise as a result. The decoder is tested on a challenging corpus of connected digit strings mixed monaurally at 0 dB and recognition performance is compared with that achieved by listeners using identical data. The results, although preliminary, are encouraging and suggest that techniques which interface ASA and statistical ASR have great potential. The paper concludes with a discussion of future research directions that may further develop this class of perceptually motivated ASR solutions.

Full Text