Abstract
A major deficiency in state-of-the-art automatic speech recognition systems is the lack of robustness in additive and convolutive noise. The model of auditory perception, as developed by Dau et al. [J. Acoust. Soc. Am. 99, 3615–3622 (1996)] for psychoacoustical purposes, partly overcomes these difficulties when used as a front-end for speech recognition. Especially in combination with locally-recurrent neural networks (LRNN) the model output, called ‘‘internal representation’’ had been shown to provide highly robust feature vectors [Tchorz and Kollmeier, J. Acoust. Soc. Am. (submitted)]. To further improve the performance of this auditory-based LRNN recognition system in background noise, different speech enhancement methods were examined. The minimum mean-square error (MMSE) short-term spectral amplitude estimator (STSA), as proposed by Ephraim and Malah [IEEE Trans. Acoust., Speech, Signal Process. 32, 1109–1121 (1984)], was compared to a binaural Wiener filter [Wittkop et al., this meeting], based on directional and coherence cues. Both noise reduction algorithms yield highly improved recognition rates in nonreverberant noisy conditions, while the performance in clean speech is not significantly affected. The algorithms were also evaluated in real-world reverberant conditions with speech-simulating noise and jammer speech.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.