Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion

Wen-Chin Huang,Hsin-Min Wang,Yu Tsao,Chen-Chou Lo,Hao Luo,Yu-Huai Peng,Hsin-Te Hwang

doi:10.1109/tetci.2020.2977678

Abstract

An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal. The effectiveness of variational autoencoder (VAE) based VC (VAE-VC), for instance, strongly relies on this principle. In our prior work, we proposed a cross-domain VAE-VC (CDVAE-VC) framework, which utilized acoustic features of different properties, to improve the performance of VAE-VC. We believed that the success came from more disentangled latent representations. In this paper, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning, in order to further increase the degree of disentanglement, thereby improving the quality and similarity of converted speech. More specifically, we first investigate the effectiveness of incorporating the generative adversarial networks (GANs) with CDVAE-VC. Then, we consider the concept of domain adversarial training and add an explicit constraint to the latent representation, realized by a speaker classifier, to explicitly eliminate the speaker information that resides in the latent code. Experimental results confirm that the degree of disentanglement of the learned latent representation can be enhanced by both GANs and the speaker classifier. Meanwhile, subjective evaluation results in terms of quality and similarity scores demonstrate the effectiveness of our proposed methods.

Highlights

V OICE conversion (VC) aims to convert the speech from a source to that of a target without changing the linguistic content [1]
The motivations of CDVAE-VC are: (1) the effectiveness of variational autoencoder (VAE)-VC using vocoder spectra has been confirmed, the use of other types of spectral features, such as mel-cepstral coefficients (MCCs) [56] that are related to human perception and have been widely used in VC, have not been properly investigated; (2) since modeling the low- and high-dimensional features alone has their respective shortcomings, based on multitarget/task learning [57], [58], it is believed that a model capable of simultaneously modeling two types of spectral features can yield better performance even if they are from the same feature domain
We extend the CDVAE-VC framework by incorporating the concept of adversarial training to improve the degree of disentanglement as well as the conversion performance

Summary

INTRODUCTION

V OICE conversion (VC) aims to convert the speech from a source to that of a target without changing the linguistic content [1]. The motivations of CDVAE-VC are: (1) the effectiveness of VAE-VC using vocoder spectra (e.g., the STRAIGHT spectra, SPs [55]) has been confirmed, the use of other types of spectral features, such as mel-cepstral coefficients (MCCs) [56] that are related to human perception and have been widely used in VC, have not been properly investigated; (2) since modeling the low- and high-dimensional features alone has their respective shortcomings, based on multitarget/task learning [57], [58], it is believed that a model capable of simultaneously modeling two types of spectral features can yield better performance even if they are from the same feature domain To this end, CDVAE-VC [54] extended the VAE-VC framework to jointly consider two kinds of spectral features, namely SPs and MCCs. By introducing two additional crossdomain reconstruction losses and a latent similarity constraint into the training objective, the latent representations encoded from the input SPs and MCCs are biased to each other and capable of self- or cross-reconstructing the input features.

BACKGROUND

VAE-VC

CDVAE-VC

INCORPORATING CDVAE-VC WITH GANS

The GAN Objective in the General VAE-VC

The Classifier Loss

Experimental Settings

Objective

Applying GANs to Different Features

Effectiveness of GANs

Effectiveness of CLS

Disentanglement Measure

CONCLUSIONS

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Emerging Topics in Computational Intelligence	Publication Date: Aug 1, 2020
Citations: 41	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Emerging Topics in Computational Intelligence

Lead the way for us

Similar Papers

Adversarial Variational Embedding for Robust Semi-supervised Learning
Xiang Zhang ... Feng Yuan
-
Xiang Zhang, et. al.Xiang Zhang ... Feng Yuan
25 Jul 2019
25 Jul 2019

Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks
Jinhui Chen ... Zhaojie Luo
APSIPA Transactions on Signal and Information Processing | VOL. 8
Jinhui Chen, et. al.Jinhui Chen ... Zhaojie Luo
01 Jan 2019
APSIPA Transactions on Signal and Information Processing | VOL. 8

A Multidomain Generative Adversarial Network for Hoarse-to-Normal Voice Conversion
Minghang Chu ... Di Wu
Journal of Voice | VOL. -
Minghang Chu, et. al.Minghang Chu ... Di Wu
01 Oct 2023
Journal of Voice | VOL. -

Information Generative Bayesian Adversarial Networks: A Representation Learning Model for Transmission Gear Parameters
Jie Li ... Haibo He
IEEE-ASME Transactions on Mechatronics | VOL. 24
Jie Li, et. al.Jie Li ... Haibo He
01 Oct 2019
IEEE-ASME Transactions on Mechatronics | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Emerging Topics in Computational Intelligence