Voice Transformation Using Two-Level Dynamic Warping and Neural Networks

Al-Waled Al-Dulaimi,Jacob H Gunther,Todd K Moon

doi:10.3390/signals2030028

Abstract

Voice transformation, for example, from a male speaker to a female speaker, is achieved here using a two-level dynamic warping algorithm in conjunction with an artificial neural network. An outer warping process which temporally aligns blocks of speech (dynamic time warp, DTW) invokes an inner warping process, which spectrally aligns based on magnitude spectra (dynamic frequency warp, DFW). The mapping function produced by inner dynamic frequency warp is used to move spectral information from a source speaker to a target speaker. Artifacts arising from this amplitude spectral mapping are reduced by reconstructing phase information. Information obtained by this process is used to train an artificial neural network to produce spectral warping information based on spectral input data. The performance of the speech mapping compared using Mel-Cepstral Distortion (MCD) with previous voice transformation research, and it is shown to perform better than other methods, based on their reported MCD scores.

Highlights

Voice transformation (VT) refers to the process of changing speech so that speech uttered by one speaker sounds as if another speaker had spoken it, for example, transforming from a male voice to a female voice [1]
We found that the quality of the speech was significantly improved by using a phase reconstruction algorithm on the warped speech
The audio file sounded significantly better. This verified that the Mel-Cepstral Distortion (MCD) was valid as a measure of audio quality, and that the phase reconstruction was an important part of the VT

Summary

Introduction

Voice transformation (VT) refers to the process of changing speech so that speech uttered by one speaker (the source speaker) sounds as if another speaker (the target speaker) had spoken it, for example, transforming from a male voice to a female voice [1]. In a training stage, performed offline, features are extracted from both source and target speech sources. VT, which maps information from the source speech to target speech, is trained to form conversion rules. These conversion rules are used to convert features from a source to a target. The spectral features computed using the DFT are transformed using a neural network (NN). Speech signals are aligned using dynamic warping (DW) which aligns both temporally and spectrally to produce training data which is used to train a NN. The NN is trained to produce the spectral mapping functions determined by the frequency-domain warping.

Previous Voice Transformation Research

Two-Level Dynamic Warping

Experiment Description and a Proof of Concept Result

Spectral Warping Using NN

Mel-Cepstral Distortion as an Objective Measure

NN Architecture Experiments

Phase One

Phase Two

Phase Three

Other NN Structures

Comparison

Findings

Summary and Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Signals	Publication Date: Jul 14, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Voice Transformation Using Two-Level Dynamic Warping and Neural Networks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Signals

Lead the way for us

Similar Papers

Voice Transformation Using Two-Level Dynamic Warping
Al-Waled Al-Dulaimi ... Jacob H Gunther
-
Al-Waled Al-Dulaimi, et. al.Al-Waled Al-Dulaimi ... Jacob H Gunther
01 Nov 2019
01 Nov 2019

Analyzing Artificial Neural Networks and Dynamic Time Warping for spoken keyword recognition under transient noise conditions
Paulo Lopez-Meyer ... Omesh Tickoo
-
Paulo Lopez-Meyer, et. al.Paulo Lopez-Meyer ... Omesh Tickoo
01 Dec 2015
01 Dec 2015

Voice Transformation by Mapping the Features at Syllable Level
K. Sreenivasa Rao ... R. H. Laskar
-
K. Sreenivasa Rao, et. al.K. Sreenivasa Rao ... R. H. Laskar
18 Dec 2007
18 Dec 2007

A novel approach to remove outliers for parallel voice conversion
Nirmesh J Shah ... Hemant A Patil
Computer Speech & Language | VOL. 58
Nirmesh J Shah, et. al.Nirmesh J Shah ... Hemant A Patil
18 Apr 2019
Computer Speech & Language | VOL. 58

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Voice Transformation Using Two-Level Dynamic Warping and Neural Networks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Signals