Abstract

This paper describes a novel two-stage dereverberation feature enhancement method for noise-robust automatic speech recognition. In the first stage, an estimate of the dereverberated speech is generated by matching the distribution of the observed reverberant speech to that of clean speech, in a decorrelated transformation domain that has a long temporal context in order to address the effects of reverberation. The second stage uses this dereverberated signal as an initial estimate within a non-negative matrix factorization framework, which jointly estimates a sparse representation of the clean speech signal and an estimate of the convolutional distortion. The proposed feature enhancement method, when used in conjunction with automatic speech recognizer back-end processing, is shown to improve the recognition performance compared to three other state-of-the-art techniques.

Highlights

  • Automatic speech recognition (ASR) is becoming an effective and versatile way to interact with modern machine interfaces

  • Previous studies have attempted to counteract the convolutional distortion caused by reverberation using a number of denoising methods, such as frequency domain linear prediction [3], modulation filtered spectrograms [4], or missing-data mask estimation designed for dereverberation [5]

  • While xcould be used directly as input for a speech recognition system, in existing work on negative matrix factorization (NMF)-based source separation for speech in additive noise [13], better performance was obtained by using the same Wiener-filtering approach we have described for the distribution matching (DM)-based initialization

Read more

Summary

Introduction

Automatic speech recognition (ASR) is becoming an effective and versatile way to interact with modern machine interfaces. For instance in [2], it was shown that even with state-of-the-art DNN systems, Previous studies have attempted to counteract the convolutional distortion caused by reverberation using a number of denoising methods, such as frequency domain linear prediction [3], modulation filtered spectrograms [4], or missing-data mask estimation designed for dereverberation [5]. All of these approaches make weak assumptions about the reverberant data (e.g., they do not require that the room impulse response is known) but they achieve only a moderate increase in ASR performance. In conditions with relatively long reverberation times, REMOS provides higher recognition accuracy than a matched model

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.