Abstract

Decent vibratos are a trait of good vocal training, often associated with perceived level of singing skill. In this paper we present a system for multi-singer singing voice synthesis, which is capable of producing high quality singing with convincing, controllable vibratos, and can also synthesize natural singing voices for target speakers with only speech data. This is enabled by using a unified speech-and-singing acoustic model that not only bridges the modality gap but also helps make best use of both types of data. The acoustic model exposes the full F0 contour therefore allowing explicitly modelling of F0 characteristics specific to singing voice. We observe that short-time Fourier transform of the F0 contour sparsely encode vibrato characteristics, and derive a learning objective therefrom for improved vibrato production. Control of the synthesized vibrato extent is possible by wiring in a supervised “extent” neuron and expose it to outer system. Experimental results confirm the effectiveness of proposed objective in producing good vibratos and improving overall perceived singing voice quality.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call