Creating Song From Lip and Tongue Videos With a Convolutional Vocoder

Jianyu Zhang,Pierre Roussel,Bruce Denby

doi:10.1109/access.2021.3050843

Abstract

A convolutional neural network and deep autoencoder are used to predict Line Spectral Frequencies, F0, and a voiced/unvoiced flag in singing data, using as input only ultrasound images of the tongue and visual images of the lips. A novel convolutional vocoder to transform the learned parameters into an audio signal is also presented. Spectral Distortion of predicted Line Spectral Frequencies is reduced compared to that in an earlier study using handcrafted features and multilayer perceptrons on the same data set; while predicted F0 and voiced/unvoiced flag predictions are found to be highly correlated with their ground truth values. Comparison of the convolutional vocoder to standard vocoders is made. Results can be of interest in the study of singing articulation as well as for silent speech interface research. Sample predicted audio files are available online. Source code: <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/TjuJianyu/SSI_DL</uri> .

Highlights

The past several years have seen a growing interest in multimodal speech processing, for combining audio tracks with video of the speaker to enhance speech recognition in noisy environments [1], [2]; to perform lip reading [3], [4]; or in Silent Speech Interface (SSI) applications [5]–[7]
-called ‘‘neural’’ vocoders have begun to appear as an alternative to source-filter synthesizers, sometimes involving the use of the Generative Adversarial Networks (GAN) [15]–[17] that are widely used in generation tasks [18]
Performance of the Line Spectral Frequencies (LSF) prediction is measured in dB of Spectral Distortion, defined as the root mean square of the differences in dB, at a fixed set of frequencies, between the LPC polynomials derived from the original and the learned LSF values [14]

Summary

Introduction

The past several years have seen a growing interest in multimodal speech processing, for combining audio tracks with video of the speaker to enhance speech recognition in noisy environments [1], [2]; to perform lip reading [3], [4]; or in Silent Speech Interface (SSI) applications [5]–[7]. Speech processing experiments to create the large data sets of high-quality acoustic parameters necessary for training and parameter-tuning of GAN-based and other neural vocoders. In the method (Figure 1), a CNN/DAE architecture first learns LSF, F0, and voiced/unvoiced (U/V) flag from ultrasound tongue and visual lip images, using ground truths derived from an audio track.

Results

Conclusion