Deep Gaussian process based multi-speaker speech synthesis with latent speaker representation

Kentaro Mitsui,Tomoki Koriyama,Hiroshi Saruwatari

doi:10.1016/j.specom.2021.07.001

Kentaro Mitsui, Tomoki Koriyama + Show 1 more

Open Access

https://doi.org/10.1016/j.specom.2021.07.001

Copy DOI

Journal: Speech Communication	Publication Date: Jul 10, 2021
Citations: 6	License type: cc-by

Affiliation: The University of Tokyo

Abstract

This paper proposes deep Gaussian process (DGP)-based frameworks for multi-speaker speech synthesis and speaker representation learning. A DGP has a deep architecture of Bayesian kernel regression, and it has been reported that DGP-based single speaker speech synthesis outperforms deep neural network (DNN)-based ones in the framework of statistical parametric speech synthesis. By extending this method to multiple speakers, it is expected that higher speech quality can be achieved with a smaller number of training utterances from each speaker. To apply DGPs to multi-speaker speech synthesis, we propose two methods: one using DGP with one-hot speaker codes, and the other using a deep Gaussian process latent variable model (DGPLVM). The DGP with one-hot speaker codes uses additional GP layers to transform speaker codes into latent speaker representations. The DGPLVM directly models the distribution of latent speaker representations and learns it jointly with acoustic model parameters. In this method, acoustic speaker similarity is expressed in terms of the similarity of the speaker representations, and thus, the voices of similar speakers are efficiently modeled. We experimentally evaluated the performance of the proposed methods in comparison with those of conventional DNN and variational autoencoder (VAE)-based frameworks, in terms of acoustic feature distortion and subjective speech quality. The experimental results demonstrate that (1) the proposed DGP-based and DGPLVM-based methods improve subjective speech quality compared with a feed-forward DNN-based method, and (2) even when the amount of training data for target speakers is limited, the DGPLVM-based method outperforms other methods, including the VAE-based one. Additionally, (3) by using a speaker representation randomly sampled from the learned speaker space, the DGPLVM-based method can generate voices of non-existent speakers.

Full Text