A BLSTM and WaveNet-Based Voice Conversion Method With Waveform Collapse Suppression by Post-Processing

Xiaokong Miao,Meng Sun,Tieyong Cao,Changyan Zheng,Xiongwei Zhang

doi:10.1109/access.2019.2912926

Xiaokong Miao, Meng Sun + Show 3 more

Open Access

https://doi.org/10.1109/access.2019.2912926

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2019
Citations: 23	License type: cc-by-nc-nd

Affiliation: PLA Army Engineering University

Abstract

In recent years, neural network-based voice conversion methods have been rapidly developed, and many different models and neural networks have been applied in parallel voice conversion. However, the over-smoothing of parametric methods [e.g., bidirectional long short-term memory (BLSTM)] and the waveform collapse of neural vocoders (e.g., WaveNet) still have negative impacts on the quality of the converted voices. To overcome this problem, we propose a BLSTM and WaveNet-based voice conversion method cooperated with waveform collapse suppression by post-processing. This method firstly uses BLSTM to convert the acoustic features between parallel speakers, and then synthesizes pre-converted voice with WaveNet. Subsequently, several alternative iterations of BLSTM post-processing is performed, and the final converted voice is generated by WaveNet. The proposed method can directly generate converted audio waveforms and avoid the waveform-collapsed speech caused by a single WaveNet generation effectively. The experimental results indicate that acoustic features trained by using the BLSTM network could achieve better results than conventional baselines. From our experiments on VCC2018, the usage of WaveNet could alleviate the problem of over-smoothing, which contributes to improving the similarity and naturalness of the final results of voice conversion.

Full Text