Speaker Adaptation for Speech Synthesis Based on Deep Neural Networks Using Hidden Semi-Markov Model Structures

Kento Nakao,Keiichi Tokuda,Kei Hashimoto,Yoshihiko Nankaku,Keiichiro Oura

doi:10.23919/apsipa.2018.8659791

Abstract

This paper proposes a speaker adaptation technique for speech synthesis-based deep neural networks (DNNs) using hidden semi-Markov model (HSMM) structures. Speaker adaptation techniques for DNN-based speech synthesis are based on fixed time-alignments estimated by external aligners. Therefore, the acoustic features and temporal structures of speech are separately adapted in speaker adaptation. In this work, a special type of mixture density network (MDN) called MDN-HSMM, which outputs the parameters of HSMMs, is applied. The proposed method can model not only acoustic features but also durations in a unified framework and perform speaker adaptation that considers temporal structures. Experimental results show that the proposed method improves the naturalness and speaker similarity of the synthesized speech compared to the speaker adaptation based on DNNs.

Full Text