The basic goal of the voice conversion system is to modify the speaker-specific characteristics, keeping the message and the environmental information contained in the speech signal intact. Speaker characteristics reflect in speech at different levels, such as, the shape of the glottal pulse (excitation source characteristics), the shape of the vocal tract (vocal tract system characteristics) and the long-term features (suprasegmental or prosodic characteristics). In this paper, we are proposing neural network models for developing mapping functions at each level. The features used for developing the mapping functions are extracted using pitch synchronous analysis. Pitch synchronous analysis provides the estimation of accurate vocal tract parameters, by analyzing the speech signal independently in each pitch period without influenced by the adjacent pitch cycles. In this work, the instants of significant excitation are used as pitch markers to perform the pitch synchronous analysis. The instants of significant excitation correspond to the instants of glottal closure (epochs) in the case of voiced speech, and to some random excitations like onset of burst in the case of nonvoiced speech. Instants of significant excitation are computed from the linear prediction (LP) residual of speech signals by using the property of average group-delay of minimum phase signals. In this paper, line spectral frequencies (LSFs) are used for representing the vocal tract characteristics, and for developing its associated mapping function. LP residual of the speech signal is viewed as excitation source, and the residual samples around the instant of glottal closure are used for mapping. Prosodic parameters at syllable and phrase levels are used for deriving the mapping function. Source and system level mapping functions are derived pitch synchronously, and the incorporation of target prosodic parameters is performed pitch synchronously using instants of significant excitation. The performance of the voice conversion system is evaluated using listening tests. The prediction accuracy of the mapping functions (neural network models) used at different levels in the proposed voice conversion system is further evaluated using objective measures such as deviation ( D i ) , root mean square error ( μ RMSE ) and correlation coefficient ( γ X , Y ). The proposed approach (i.e., mapping and modification of parameters using pitch synchronous approach) used for voice conversion is shown to be performed better compared to the earlier method (mapping the vocal tract parameters using block processing) proposed by the author.
Read full abstract