Neural Vocoding for Singing and Speaking Voices with the Multi-Band Excited WaveNet

Axel Roebel,Frederik Bous

doi:10.3390/info13030103

Abstract

The use of the mel spectrogram as a signal parameterization for voice generation is quite recent and linked to the development of neural vocoders. These are deep neural networks that allow reconstructing high-quality speech from a given mel spectrogram. While initially developed for speech synthesis, now neural vocoders have also been studied in the context of voice attribute manipulation, opening new means for voice processing in audio production. However, to be able to apply neural vocoders in real-world applications, two problems need to be addressed: (1) To support use in professional audio workstations, the computational complexity should be small, (2) the vocoder needs to support a large variety of speakers, differences in voice qualities, and a wide range of intensities potentially encountered during audio production. In this context, the present study will provide a detailed description of the Multi-band Excited WaveNet, a fully convolutional neural vocoder built around signal processing blocks. It will evaluate the performance of the vocoder when trained on a variety of multi-speaker and multi-singer databases, including an experimental evaluation of the neural vocoder trained on speech and singing voices. Addressing the problem of intensity variation, the study will introduce a new adaptive signal normalization scheme that allows for robust compensation for dynamic and static gain variations. Evaluations are performed using objective measures and a number of perceptual tests including different neural vocoder algorithms known from the literature. The results confirm that the proposed vocoder compares favorably to the state-of-the-art in its capacity to generalize to unseen voices and voice qualities. The remaining challenges will be discussed.

Highlights

A Vocoder is a parametric model of speech or singing voice signals that allows reproduction of a speech signal from parameters following an analysis/synthesis procedure
In the Ref. [41], we presented our first results concerning the Multi-Band Excited WaveNet (MBExWN), a neural vocoder performing a perceptually nearly transparent analysis/resynthesis for seen and unseen voice identities, as well as seen and unseen voice qualities
The present paper introduces the following innovations compared to the Ref. [41]: Automatic and adaptive signal normalization: A universal neural vocoder should work independently from the signal energy

Summary

Introduction

A Vocoder is a parametric model of speech or singing voice signals that allows reproduction of a speech signal from parameters following an analysis/synthesis procedure. In the case of voice synthesis, the analysis procedure may be replaced by means of a generator that directly produces the vocoder parameters for synthesis. A large number of vocoders employing various techniques have been proposed [5,6,7,8,9,10,11]. Most of these systems rely, in one form or another, on the source-filter model of voice production [12,13]. While the precise and robust estimation of these parameters was already a challenging research problem, the formulation of voice models that represent the relevant interactions did remain elusive

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Information	Publication Date: Feb 23, 2022
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Neural Vocoding for Singing and Speaking Voices with the Multi-Band Excited WaveNet

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information

Lead the way for us

Similar Papers

Emotional Speech Synthesis for Multi-Speaker Emotional Dataset Using WaveNet Vocoder
Heejin Choi ... Sangjun Park
-
Heejin Choi, et. al.Heejin Choi ... Sangjun Park
01 Jan 2019
01 Jan 2019

Unified Source-Filter GAN: Unified Source-Filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN
Reo Yoneyama ... Tomoki Toda
-
Reo Yoneyama, et. al.Reo Yoneyama ... Tomoki Toda
30 Aug 2021
30 Aug 2021

Wavefit: an Iterative and Non-Autoregressive Neural Vocoder Based on Fixed-Point Iteration
Yuma Koizumi ... Kohei Yatabe
-
Yuma Koizumi, et. al.Yuma Koizumi ... Kohei Yatabe
09 Jan 2023
09 Jan 2023

음질 및 속도 향상을 위한 선형 스펙트로그램 활용 Text-to-speech
Hyebin Yoon ... Hosung Nam
Phonetics and Speech Sciences | VOL. 13
Hyebin Yoon, et. al.Hyebin Yoon ... Hosung Nam
01 Sep 2021
Phonetics and Speech Sciences | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Neural Vocoding for Singing and Speaking Voices with the Multi-Band Excited WaveNet

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information