Abstract

We explore joint training strategies of DNNs for simultaneous dereverberation and acoustic modeling to improve the performance of distant speech recognition. There are two key contributions. First, a new DNN structure incorporating both dereverberated and original reverberant features is shown to effectively improve recognition accuracy over the conventional one using only dereverberated features as the input. Second, in most of the simulated reverberant environments for training data collection and DNN-based dereverberation, the resource data and learning targets are high-quality clean speech. With our joint training strategy, we can relax this constraint by using large-scale diversified real close-talking data as the targets which are easy to be collected via many speech-enabled applications from mobile internet users, and find the scenario even more effective. Our experiments on a Mandarin speech recognition task with 2000-h training data show that the proposed framework achieves relative word error rate reductions of 9.7 and 8.6 % over the multi-condition training systems for the cases of single-channel and multi-channel with beamforming, respectively. Furthermore, significant gains are consistently observed over the pre-processing approach using simply DNN-based dereverberation.

Highlights

  • With the fast development of mobile internet, handsfree speech interaction with automatic speech recognition (ASR) system is natural and becoming more and more popular

  • The results show that deep neural networks (DNNs)-based pDAE (DAE with phone-class information) slightly outperformed pLSTM on real testing data

  • Since we mainly focus on the effects of reverberation, white noise n(t) controlled by the gain α in order to obtain different SNRs is added to the reverberant speech to simulate background noise

Read more

Summary

Introduction

With the fast development of mobile internet, handsfree speech interaction with automatic speech recognition (ASR) system is natural and becoming more and more popular. In these application scenarios, speech signal is often corrupted by reverberation and background noise. Reverberation is the collection of reflected sounds from the surfaces in an enclosure like an auditorium It is a desirable property of auditoriums to the extent that it helps to overcome the inverse square law drop-off of sound intensity in the enclosure. If it is excessive, it can make the sounds run together with the loss of articulation, muddy and garbled effects. The room reverberation leads to the severe degradation of ASR performance

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call