Joint and Adversarial Training with ASR for Expressive Speech Synthesis

Kaili Zhang,Cheng Gong,Jianguo Wei,Wenhuan Lu,Longbiao Wang,Dawei Liu

doi:10.1109/icassp43922.2022.9746442

Abstract

Style modeling is an important issue and has been proposed in expressive speech synthesis. In existing unsupervised methods, the style encoder extracts the latent representation from the reference audio as style information. However, the style information extracted from the style encoder will entangle some content information, which will cause conflicts with the real input content, and the synthesized speech will be influenced. In this study, we propose to alleviate the entanglement problem by integrating Text-To-Speech (TTS) model and Automatic Speech Recognition (ASR) model with a share layer network for joint training, and using ASR adversarial training to eliminate the content information in the style information. At the same time, we propose an adaptive adversarial weight learning strategy to prevent the model from collapsing. The objective evaluation using word error rate(WER) demonstrates that our method can effectively alleviate the entanglement between style and content information. Subjective evaluation indicates that the method improves the quality of synthesized speech and enhances the ability of style transfer compared with the baseline models.

Full Text