Ensemble of jointly trained deep neural network-based acoustic models for reverberant speech recognition

Moa Lee,Jeehye Lee,Joon-Hyuk Chang

doi:10.1016/j.dsp.2018.11.005

Abstract

Distant speech recognition is a challenge, particularly due to the corruption of speech signals by reverberation caused by large distances between the speaker and microphone. In order to cope with a wide range of reverberations in real-world situations, we present novel approaches for acoustic modeling including an ensemble of deep neural networks (DNNs) and an ensemble of jointly trained DNNs. First, multiple DNNs are firstly designed, each of which copes with a different reverberation time (RT60) in a setup step. Also, each model in the ensemble of DNN acoustic models is further jointly trained, including both feature mapping and acoustic modeling, where feature mapping is designed for dereverberation as a front-end. In a testing phase, ensemble of DNNs are combined by weighted averaging of the prediction probabilities of the RT60 estimates, which is obtained by the convolutional neural network (CNN). In other words, the posterior probability outputs from DNNs are combined using the CNN-based weights as a weighted average. Extensive experiments demonstrate that the proposed approach leads to substantial improvements in speech recognition accuracy over the conventional DNN baseline systems under diverse reverberant conditions. In this paper, experiments are performed on Aurora-4 and CHiME-4 databases.

Full Text