Abstract

In this paper, we propose a frame selection approach to voice conversion with speaker independent deep neural network (SIDNN) and Kullback-Leibler divergence (KLD). The acoustic difference between source and target speaker is equalized with SI-DNN in the ASR senone phonetic space. KLD is used as an ideal distortion measure to select the corresponding target frame given the source frame. Acoustic trajectory of the selected frames is rendered with maximum probability trajectory generation algorithm. WaveNet based vocoder is applied on the converted acoustic trajectory to get the final speech waveform. From the subjective results we find that 1) the proposed method can achieve better performance than the phonetic cluster based selection method [16]; 2) by applying WaveNet vocoder the naturalness and speaker similarity can be significantly improved compared with linear predictive coding (LPC) based vocoder; 3) WaveNet vocoder trained only with spectral features i.e., line spectrum pairs (LSP) can better maintain the pitch pattern towards target speaker than WaveNet vocoder trained with both spectral features i.e., LSP and prosodic features (F0 and Unvoiced/Voiced flag).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call