Investigation of Fast and Efficient Methods for Multi-Speaker Modeling and Speaker Adaptation

Yibin Zheng,Li Lu,Xinhui Li

doi:10.1109/icassp39728.2021.9413396

Abstract

In this paper, we propose a novel method for fast and efficient few-shot TTS task, which is able to disentangle linguistic and speaker representations. Specifically, an adversarial training strategy is firstly employed to wipe out speaker information from the linguistic representations. Then the speaker representations are extracted from audio signals by a speaker encoder with a random sampling mechanism and a speaker classifier, aiming to extract speaker embedding features that are independent of content information (such as prosody and style etc). Meanwhile, for faster and efficient adaptation, we further introduce the prior alignment knowledge between the text and audio pairs and propose a multi-alignment guided attention to help the attention learning. The Experimental results show the proposed method not only could generate higher speech quality and speaker similarity with an average absolute improvement of 0.26 and 0.30 in MOS respectively, when adapting to new speakers with 20 utterances, but also converge much faster and efficient. More-over, we can achieve a MOS of 4.45 for a premium voice, which outperforms a single speaker model of 4.23. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup>

Full Text