Abstract

This work presents a series of experiments that compare the performance of human speech recognition (HSR) and automatic speech recognition (ASR). The goal of this line of research is to learn from the differences between HSR and ASR and to use this knowledge to incorporate new signal processing strategies from the human auditory system in automatic classifiers. A database with noisy nonsense utterances is used both for HSR and ASR experiments with focus on the influence of intrinsic variation (arising from changes in speaking rate, effort, and style). A standard ASR system is found to reach human performance level only when the signal-to-noise ratio is increased by 15 dB, which can be seen as the human–machine gap for speech recognition on a sub-lexical level. The sources of intrinsic variation are found to severely degrade phoneme recognition scores both in HSR and in ASR. A comparison of utterances produced at different speaking rates indicates that temporal cues are not optimally exploited in ASR, which results in a strong increase of vowel confusions. Alternative feature extraction methods that take into account temporal and spectro-temporal modulations of speech signals are discussed.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.