Abstract

In voice conversion (VC), it is highly desirable to obtain transformed speech signals that are perceptually close to a target speaker’s voice. To this end, a perceptually meaningful criterion where the human auditory system was taken into consideration in measuring the distances between the converted and the target voices was adopted in the proposed VC scheme. The conversion rules for the features associated with the spectral envelope and the pitch modification factor were jointly constructed so that perceptual distance measurement was minimized. This minimization problem was solved using a deep neural network (DNN) framework where input features and target features were derived from source speech signals and time-aligned version of target speech signals, respectively. The validation tests were carried out for the CMU ARCTIC database to evaluate the effectiveness of the proposed method, especially in terms of perceptual quality. The experimental results showed that the proposed method yielded perceptually preferred results compared with independent conversion using conventional mean-square error (MSE) criterion. The maximum improvement in perceptual evaluation of speech quality (PESQ) was 0.312, compared with the conventional VC method.

Highlights

  • Voice conversion (VC) is a method of changing the features derived from speech signals, so that one voice is made to sound like another

  • We extended the utility of the perceptual evaluation of speech quality (PESQ) to construct VC mapping rules and evaluate the quality of the converted speech

  • The performance of the four conventional VC methods including the minimum mean square error (MMSE)-based joint Gaussian method (JGMM) [20], the maximum likelihood trajectory conversion method (JDGMM) [21], dynamic frequency warping with amplitude scaling (DFW) [22], and deep neural network (DNN)-based conversion with independent pitch scaling (MLP-ind) were evaluated

Read more

Summary

Introduction

Voice conversion (VC) is a method of changing the features derived from speech signals, so that one voice is made to sound like another. Since converted speech will be listened to by a human, it is highly desirable to adopt human auditory-based distance as an objective function to be minimized Such a distance measure has already been used for various forms of speech processing procedures, such as speech enhancement [25], speech recognition [26], speech coding [27], speech synthesis [28,29] and speech quality evaluation [30]. In the proposed VC method, the conversion rules for both the spectral envelope and the pitch were designed so that the perceptual differences between the converted spectra and that of the target speech is minimized. The two conversion functions, one for the spectral envelope and the other for the spectral fine structure, were cascade connected and incrementally trained to reduce the unique objective function This differs from the conventional VC approaches wherein the conversion rules for each feature parameter are independently constructed by minimizing the separate distance measurements.

The Structure of the Proposed VC Method
Perceptual Distance
Estimation of the Conversion Parameters
Experiment Setup
Determination of the Weights for Each Disturbance
Objective Evaluation
Subjective Evaluation
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call