Grid-based approximation for voice conversion in low resource environments

Hadas Benisty,Koby Crammer,David Malah

doi:10.1186/s13636-016-0081-1

Hadas Benisty, Koby Crammer + Show 1 more

Open Access

https://doi.org/10.1186/s13636-016-0081-1

Copy DOI

Abstract

The goal of voice conversion is to modify a source speaker's speech to sound as if spoken by a target speaker. Common conversion methods are based on Gaussian mixture modeling (GMM). They aim to statistically model the spectral structure of the source and target signals and require relatively large training sets (typically dozens of sentences) to avoid over-fitting. Moreover, they often lead to muffled synthesized output signals, due to excessive smoothing of the spectral envelopes. Mobile applications are characterized with low resources in terms of training data, memory footprint, and computational complexity. As technology advances, computational and memory requirements become less limiting; however, the amount of available training data still presents a great challenge, as a typical mobile user is willing to record himself saying just few sentences. In this paper, we propose the grid-based (GB) conversion method for such low resource environments, which is successfully trained using very few sentences (5---10). The GB approach is based on sequential Bayesian tracking, by which the conversion process is expressed as a sequential estimation problem of tracking the target spectrum based on the observed source spectrum. The converted Mel frequency cepstrum coefficient (MFCC) vectors are sequentially evaluated using a weighted sum of the target training vectors used as grid points. The training process includes simple computations of Euclidian distances between the training vectors and is easily performed even in cases of very small training sets. We use global variance (GV) enhancement to improve the perceived quality of the synthesized signals obtained by the proposed and the GMM-based methods. Using just 10 training sentences, our enhanced GB method leads to converted sentences having closer GV values to those of the target and to lower spectral distances at the same time, compared to enhanced version of the GMM-based conversion method. Furthermore, subjective evaluations show that signals produced by the enhanced GB method are perceived as more similar to the target speaker than the enhanced GMM signals, at the expense of a small degradation in the perceived quality.

Highlights

Voice conversion systems aim to modify the perceived identity of a source speaker saying a sentence to that of a given target speaker
To further improve the quality, we applied a global variance (GV) enhancement post-processing block. We recently proposed this GV enhancement approach and examined its effect on signals converted by a classical Gaussian mixture model (GMM) conversion method [21]
We address two related methods: (1) a method based on state space representation [17] and (2) an exemplar-based approach [18], where the converted spectra is evaluated as a weighted sum of the target training vectors

Summary

Introduction

Voice conversion systems aim to modify the perceived identity of a source speaker saying a sentence to that of a given target speaker. This kind of transformation is useful for personalization of text-to-speech (TTS) systems, voice restoration in case of vocal pathology, obtaining a false identity when answering the phone (for safety reasons, for example), and for entertainment purposes such as online role-playing games. Most voice conversion methods aim to transform the spectral envelope of the source speaker to the spectral envelope of the target speaker. The classical conversion method, based on modeling the spectral structure of the speech signals using Gaussian mixture model (GMM), is the most commonly used method to date. The conversion function is linear, trained using either least squares (LS) [1], or a joint source-

Results

Discussion

Conclusion