Abstract

Abstract To improve the performance of the HMM-based voice conversion system in which the LSP coefficient is introduced as the spectral representation, a model clustering technique to tie HMMs into classes for the model adaptation, considering the phonetic and linguistic contextual factors of HMMs, is adopted in this paper. Besides, due to the relationship between the LSP coefficients of adjacent orders, an appropriate format of the regression matrix is suggested according to the small amount of the adaptation training data. Subjective and objective tests prove that the source HMMs can be adapted more accurately using the proposed method, meanwhile the synthetic speech generated from the adapted model has better discrimination and speech quality. Index Terms : model adaptation, regression matrix clustering, and regression matrix format 1. Introduction With the development of the corpus-based speech synthesis technique, the intelligibility and naturalness of the synthetic speech has been improved a lot. However, it is still a difficult problem for the corpus-based TTS system to synthesize speech of various speakers and speaking styles with a limited database. So the voice conversion technique which can convert one speaker’s voice to another speaker’s voice provides a positive approach to achieve the goal of synthesizing speech of multi-speakers. The HMM-based voice conversion system is built on the basis of the HMM-based speech synthesis. In the HMM-based speech synthesis system, spectrum, pitch and duration are modeled simultaneously in a unified framework of HMMs [1][2][3]. In addition, voice characteristics of the synthetic speech can be converted from one speaker to another by applying a model adaptation algorithm, such as the MLLR (maximum likelihood linear regression) algorithm [4][5], with a small amount of speech uttered by the target speaker. We have realized a HMM-based speech synthesis system in which the LSP (line spectral pair) coefficients and the STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weighted spectral contour) analysis-synthesis algorithm are introduced [6][7]. Then, by realizing the MLLR algorithm, we provide our synthesis system with the ability of synthesizing voice of various speakers. However, there still exist two main problems in the HMM-based voice conversion system. Firstly, the data-driven clustering method described in the MLLR algorithm ignores many contextual factors between HMMs, therefore some unrelated HMMs are forced into one class which will affect the accuracy of the model adaptation. Secondly, the system performance including the voice characteristics and voice quality of the synthetic speech decreases greatly when the adaptation training data is very limited. In order to solve these problems, a clustering method, considering the phonetic and linguistic connections between HMMs using the context decision tree, which has been applied similarly in both the HMM-based speech recognition and the HMM-based speech synthesis areas [8][9], is described in this paper. Moreover, an appropriate regression matrix format is suggested when very few training data is available, as the LSP coefficients of only several adjacent orders have strong correlations. In the following part of this paper, an overview of our HMM-based voice conversion system is presented in section 2. Section 3 describes the details of the proposed context clustering decision tree and the appropriate regression matrix for the model adaptation. Section 4 presents the results of experiments including subjective and objective evaluations while section 5 provides a final conclusion.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call