Autoregressive Variational Autoencoder with a Hidden Semi-Markov Model-Based Structured Attention for Speech Synthesis

Takato Fujimoto,Kei Hashimoto,Keiichi Tokuda,Yoshihiko Nankaku

doi:10.1109/icassp43922.2022.9746158

Abstract

This paper proposes an autoregressive speech synthesis model based on the variational autoencoder incorporating latent sequence representation for acoustic and linguistic features and the structure of a hidden semi-Markov model (HSMM). Although autoregressive models can provide efficient and accurate modeling of acoustic features, they have exposure bias, i.e., the mismatch between training (teacher-forcing) and inference (free-running). To overcome this problem, we introduce an autoregressive latent variable sequence, rather than using autoregressive generation of observations. Latent representation of alignment using HSMM-based structured attention mechanism enables the use of a completely consistent training algorithm for acoustic modeling with explicit duration models. Experimental results indicate that the proposed model outperformed baselines in subjective naturalness.

Full Text