Abstract

In this publication, we present a deep learning-based method to transform the f0 in speech and singing voice recordings. f0 transformation is performed by training an auto-encoder on the voice signal’s mel-spectrogram and conditioning the auto-encoder on the f0. Inspired by AutoVC/F0, we apply an information bottleneck to it to disentangle the f0 from its latent code. The resulting model successfully applies the desired f0 to the input mel-spectrograms and adapts the speaker identity when necessary, e.g., if the requested f0 falls out of the range of the source speaker/singer. Using the mean f0 error in the transformed mel-spectrograms, we define a disentanglement measure and perform a study over the required bottleneck size. The study reveals that to remove the f0 from the auto-encoder’s latent code, the bottleneck size should be smaller than four for singing and smaller than nine for speech. Through a perceptive test, we compare the audio quality of the proposed auto-encoder to f0 transformations obtained with a classical vocoder. The perceptive test confirms that the audio quality is better for the auto-encoder than for the classical vocoder. Finally, a visual analysis of the latent code for the two-dimensional case is carried out. We observe that the auto-encoder encodes phonemes as repeated discontinuous temporal gestures within the latent code.

Highlights

  • Since the invention of the vocoder over 80 years ago [1], people have been studying means to transform the acoustical properties of speech recordings

  • Since enabling transparent f0 transformations has been a prominent topic inside the speech processing research community, and attempts to achieve this have been plentiful, such as the phase vocoder [2], PSOLA [3], shape-invariant additive models [4], shape-invariant phase vocoder [5], parametric speech vocoders [6,7], and extended source-filter models [8,9]

  • (3) we carry out a visual analysis on the latent code of the auto-encoder, where we find the phonetic content to be represented as periodic discontinuous patterns by the auto-encoder

Read more

Summary

Introduction

Since the invention of the vocoder over 80 years ago [1], people have been studying means to transform the acoustical properties of speech recordings. Transforming the f0 of the singing voice is crucial for singing synthesis and could help refine existing recordings. A common approach is unit selection with f0 modification [22,23] The quality of these synthesizers is limited by the quality of the transposition algorithm. As for voice modification, f0 transformation can be used on real singing recordings to alter the intonation, expression and even melody in post production. The f0 plays a different role than in singing. It carries important information, such as mood, intent and identity. Changing the f0 in speech can result in changing or obfuscating the speaker’s gender [26], an effect which could be used in scientific studies to research the effect of perceived gender in the voice in social interactions [27]

Related Work
F0 Analysis
Voice Transformation on Mel-Spectrograms
Auto-Encoders
Contributions
Problem Formulation
Input Data
Network Architecture
Training Procedure
Datasets
F0 Accuracy
Synthesis Quality
Visualization of the Latent Code
Future Work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.