Abstract

Voice transformation, for example, from a male speaker to a female speaker, is achieved here using a two-level dynamic warping algorithm in conjunction with an artificial neural network. An outer warping process which temporally aligns blocks of speech (dynamic time warp, DTW) invokes an inner warping process, which spectrally aligns based on magnitude spectra (dynamic frequency warp, DFW). The mapping function produced by inner dynamic frequency warp is used to move spectral information from a source speaker to a target speaker. Artifacts arising from this amplitude spectral mapping are reduced by reconstructing phase information. Information obtained by this process is used to train an artificial neural network to produce spectral warping information based on spectral input data. The performance of the speech mapping compared using Mel-Cepstral Distortion (MCD) with previous voice transformation research, and it is shown to perform better than other methods, based on their reported MCD scores.

Highlights

  • Voice transformation (VT) refers to the process of changing speech so that speech uttered by one speaker sounds as if another speaker had spoken it, for example, transforming from a male voice to a female voice [1]

  • We found that the quality of the speech was significantly improved by using a phase reconstruction algorithm on the warped speech

  • The audio file sounded significantly better. This verified that the Mel-Cepstral Distortion (MCD) was valid as a measure of audio quality, and that the phase reconstruction was an important part of the VT

Read more

Summary

Introduction

Voice transformation (VT) refers to the process of changing speech so that speech uttered by one speaker (the source speaker) sounds as if another speaker (the target speaker) had spoken it, for example, transforming from a male voice to a female voice [1]. In a training stage, performed offline, features are extracted from both source and target speech sources. VT, which maps information from the source speech to target speech, is trained to form conversion rules. These conversion rules are used to convert features from a source to a target. The spectral features computed using the DFT are transformed using a neural network (NN). Speech signals are aligned using dynamic warping (DW) which aligns both temporally and spectrally to produce training data which is used to train a NN. The NN is trained to produce the spectral mapping functions determined by the frequency-domain warping.

Previous Voice Transformation Research
Two-Level Dynamic Warping
Experiment Description and a Proof of Concept Result
Spectral Warping Using NN
Mel-Cepstral Distortion as an Objective Measure
NN Architecture Experiments
Phase One
Phase Two
Phase Three
Other NN Structures
Comparison
Findings
Summary and Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call