Abstract

Speech-rate conversion technology, which can expand or compress speech waveforms while preserving the pitch of the sound, is traditionally realized by signal-processing-based approaches. To improve the synthesis quality, this paper proposes a machine-learning-based approach using neural vocoders, to perform neural speech-rate conversion. The proposed approach introduces a multispeaker WaveNet vocoder trained with a multispeaker corpus. Speech-rate conversion for many and unspecified speakers, not included in the training data, is realized by resampling acoustic features or hidden features along the time direction in inference. In experiments, the multispeaker WaveNet vocoder was trained using the JVS corpus and two types of resampling methods were compared. Conventional WSOLA and STRAIGHT were also compared as signal-processing-based baselines. The test sets included Japanese speaker corpora for the monolingual condition, and an English multispeaker corpus (CMU ARCTIC) for the cross-lingual condition. The results of the experiments demonstrate that the proposed approach with resampling of hidden features can achieve higher quality speech-rate conversion than the conventional methods, in both monolingual and cross-lingual conditions, except for speakers with low fundamental frequency in conversion of fast speech.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call