Generative modeling of speech F0 contours

Hirokazu Kameoka,Shigeki Sagayama,Kota Yoshizato,Yasunori Ohishi,Tatsuma Ishihara,Kunio Kashino

doi:10.21437/interspeech.2013-450

Abstract

This paper introduces our ongoing work on generative modeling of speech fundamental frequency (F0) contours for estimating prosodic features from raw speech data. The present F0 contour model is formulated by translating the Fujisaki model, a well-founded mathematical model representing the control mechanism of vocal fold vibration, into a probabilistic model described as a discrete-time stochastic process. The motivation behind this formulation is two fold. One is to derive a general parameter estimation framework for the Fujisaki model, allowing for the introduction of powerful statistical methods. The other is to construct an automatically trainable version of the Fujisaki model so that in future it can be used to develop a statistical speaking style conversion system or incorporated into existing text-to-speech synthesis systems to improve the naturalness and intelligibility of computer-generated speech. We also briefly introduce a generative model of F0 contours of singing voice developed under the same spirit. Index Terms: speech F0 contour, Fujisaki model, generative model, hidden Markov model, EM algorithm

Full Text