Abstract

With recent advances in end-to-end text to speech (TTS), the quality of synthetic data has been significantly improved. Synthesized speech is becoming a feasible alternative to human speech for training speech recognizers. By using multi-speaker TTS, the synthesis architecture manipulations the combination of various speakers, prosodies and speaking styles. It helps to enrich the acoustic diversity and benefits the robustness of automatic speech recognition (ASR). However, the improvement of building acoustic model in ASR system with synthetic data is still limited due to the mismatch between the synthetic and the real data. Human speech is more natural and contains infoxrmation that not exist in synthetic data, such as ambient noise and frequency warping inspired by channel. In this paper, we propose two novel techniques to mitigate the problem: (i) Pre-train TTS model with large dataset, and then transfer it to each speaker for generating synthetic data which is more suitable on the ASR task. (ii) A conditional training method that improves the performance of augmenting real data with synthesized materials. Experimental results show that these methods can significantly improve the building of speech recognition systems using synthetic data. For example, the results on AISHELL-1 dataset show the proposed methods can achieve up to 41.7% relative error reduction in character error rate (CER) compared to the traditional method of building ASR model with synthetic data. Moreover, we observe up to 12.7% relative error reduction by augmenting human speech and synthesized speech with our conditional training method compared to naively using the human speech.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.