Abstract

This paper proposes a speaker adaptation technique for speech synthesis-based deep neural networks (DNNs) using hidden semi-Markov model (HSMM) structures. Speaker adaptation techniques for DNN-based speech synthesis are based on fixed time-alignments estimated by external aligners. Therefore, the acoustic features and temporal structures of speech are separately adapted in speaker adaptation. In this work, a special type of mixture density network (MDN) called MDN-HSMM, which outputs the parameters of HSMMs, is applied. The proposed method can model not only acoustic features but also durations in a unified framework and perform speaker adaptation that considers temporal structures. Experimental results show that the proposed method improves the naturalness and speaker similarity of the synthesized speech compared to the speaker adaptation based on DNNs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call