Abstract

Recently, speech enhancement (SE) methods have achieved quite good performances. However, because of the speech distortion problem, the enhanced speech may lose significant information, which degrades the performance of automatic speech recognition (ASR). To address this problem, this paper proposes a two-stage deep spectrum fusion with the joint training framework for noise-robust end-to-end (E2E) ASR. It consists of a masking and mapping fusion (MMF) and a gated recurrent fusion (GRF). The MMF is used as the first stage and focuses on SE, which explores the complementarity of the enhancement methods of masking-based and mapping-based to alleviate the problem of speech distortion. The GRF is used as the second stage and aims to further retrieve the lost information by fusing the enhanced speech of MMF and the original input. We conduct extensive experiments on an open Mandarin speech corpus AISHELL-1 with two noise datasets named 100 Nonspeech and NOISEX-92. Experimental results indicate that our proposed method significantly improves the performance and the character error rate (CER) is relatively reduced by 17.36% compared with the conventional joint training method.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.