Abstract
We present an information fusion approach to the robust recognition of multi-microphone speech. It is based on a deep learning framework with a large deep neural network (DNN) consisting of subnets designed from different perspectives. Multiple knowledge sources are then reasonably integrated via an early fusion of normalized noisy features with multiple beamforming techniques, enhanced speech features, speaker-related features, and other auxiliary features concatenated as the input to each subnet to compensate for imperfect front-end processing. Furthermore, a late fusion strategy is utilized to leverage the complementary natures of the different subnets by combining the outputs of all subnets to produce a single output set. Testing on the CHiME-3 task of recognizing microphone array speech, we demonstrate in our empirical study that the different information sources complement each other and that both early and late fusions provide significant performance gains, with an overall word error rate of 10.55% when combining 12 systems. Furthermore, by utilizing an improved technique for beamforming and a powerful recurrent neural network (RNN)-based language model for rescoring, a WER of 9.08% can be achieved for the best single DNN system with one-pass decoding among all of the systems submitted to the CHiME-3 challenge.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.