Abstract

WaveNet is introduced for waveform generation. It produces high quality text-to-speech synthesis, music generation, and voice conversion. However, it generally requires a large amount of training data, that limits its scope of applications, e.g. in voice conversion. In this paper, we propose a factorized WaveNet for limited data tasks. Specifically, we apply singular value decomposition (SVD) on the dilated convolution layers of WaveNet to reduce the number of parameters. By doing so, we reduce the data requirement for WaveNet training, while maintaining similar network performance. We use voice conversion as a case study to validate the proposed idea. Two sets of experiments are conducted, where WaveNet is used as a vocoder and an integrated converter–vocoder respectively. Experiments on CMU-ARCTIC and CSTR-VCTK corpora show that factorized WaveNet consistently outperforms its original WaveNet counterpart when using the same amount of training data. We also apply SVD similarly to real-time neural vocoder Parallel WaveGAN for voice conversion, and observe similar improvement.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call