Vibrato Learning in Multi-Singer Singing Voice Synthesis

Ruolan Liu,Chunhui Lu,Xue Wen,June Sig Sung,Liming Song

doi:10.1109/asru51503.2021.9688029

Abstract

Decent vibratos are a trait of good vocal training, often associated with perceived level of singing skill. In this paper we present a system for multi-singer singing voice synthesis, which is capable of producing high quality singing with convincing, controllable vibratos, and can also synthesize natural singing voices for target speakers with only speech data. This is enabled by using a unified speech-and-singing acoustic model that not only bridges the modality gap but also helps make best use of both types of data. The acoustic model exposes the full F0 contour therefore allowing explicitly modelling of F0 characteristics specific to singing voice. We observe that short-time Fourier transform of the F0 contour sparsely encode vibrato characteristics, and derive a learning objective therefrom for improved vibrato production. Control of the synthesized vibrato extent is possible by wiring in a supervised “extent” neuron and expose it to outer system. Experimental results confirm the effectiveness of proposed objective in producing good vibratos and improving overall perceived singing voice quality.

Full Text