Abstract

This paper proposes a speech synthesis method based on unsupervisedly-learned fine-grained style representations, named word-level style variations (WSVs), in order to improve the naturalness of synthetic speech. The whole model contains a WSV extractor and a WSV predictor. The WSV extractor is jointly trained with a sequence-to-sequence (Seq2seq) synthesizer and learns a WSV vector from the mel-spectrogram of each prosodic word in the training set by extending the global style token (GST) framework. In contrast to GST weights which describe the global styles of utterances, WSVs operate at word-level and are expected to describe local style properties, such as stresses. Besides, Gumbel softmax is adopted and the extracted WSVs are close to one-hot vectors which facilitate the subsequent prediction task. The WSV predictor is a deterministic model which generates the sequence of WSV vectors from input text using an autoregressive LSTM network. In addition to phonetic information, e.g., phoneme sequences, Bidirectional Encoder Representation from Transformers (BERT) model is employed by the predictor to obtain the semantic descriptions of input text for better predicting the latent speech representation, i.e., WSVs. The WSV predictor is trained by considering both the accuracy of WSV prediction and the distortion of mel-spectrograms recovered from the predicted WSVs. Experimental results show that our proposed method can achieve better naturalness of synthetic speech than baseline Tacotron2, text-predicted global style token (TP-GST) and BERT-Tacotron2 models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call