Speaker-adaptive-trainable Boltzmann machine and its application to non-parallel voice conversion

Toru Nakashika,Yasuhiro Minami

doi:10.1186/s13636-017-0112-6

Abstract

In this paper, we present a voice conversion (VC) method that does not use any parallel data while training the model. Voice conversion is a technique where only speaker-specific information in the source speech is converted while keeping the phonological information unchanged. Most of the existing VC methods rely on parallel data—pairs of speech data from the source and target speakers uttering the same sentences. However, the use of parallel data in training causes several problems: (1) the data used for the training is limited to the pre-defined sentences, (2) the trained model is only applied to the speaker pair used in the training, and (3) a mismatch in alignment may occur. Although it is generally preferable in VC to not use parallel data, a non-parallel approach is considered difficult to learn. In our approach, we realize the non-parallel training based on speaker-adaptive training (SAT). Speech signals are represented using a probabilistic model based on the Boltzmann machine that defines phonological information and speaker-related information explicitly. Speaker-independent (SI) and speaker-dependent (SD) parameters are simultaneously trained using SAT. In the conversion stage, a given speech signal is decomposed into phonological and speaker-related information, the speaker-related information is replaced with that of the desired speaker, and then voice-converted speech is obtained by combining the two. Our experimental results showed that our approach outperformed the conventional non-parallel approach regarding objective and subjective criteria.

Highlights

In recent years, voice conversion (VC), which is a technique used to change speaker-specific information in the speech of a source speaker into that of a target speaker while retaining linguistic information, has been garnering much attention since the VC techniques can be applied to various tasks [1,2,3,4,5]
5.4 Evaluation using various speaker pairs We investigated the performance of the proposed speaker adaptive trainable Boltzmann machine (SATBM) using various speaker pairs that include four gender types: male-to-female (M2F), female-to-male (F2M), male-to-male (M2M), and female-to-female (F2F)
When we compare the VC performance of different speaker pairs using the Mel-cepstral distortion (MCD), we may conclude that the model performed best when converting “ECL0001” to “MIT0001” because this conversion provided the smallest MCD

Summary

Introduction

Voice conversion (VC), which is a technique used to change speaker-specific information in the speech of a source speaker into that of a target speaker while retaining linguistic information, has been garnering much attention since the VC techniques can be applied to various tasks [1,2,3,4,5]. Most of the existing approaches rely on statistical models [6, 7], and the approach based on the Gaussian mixture model (GMM) [8,9,10,11] is one of the mainstream methods used nowadays. Other statistical models, such as non-negative matrix factorization (NMF) [12, 13], neural networks (NNs) [14], restricted Boltzmann machines (RBMs) [15, 16], and deep learning [17, 18], are used in VC. The multistep VC [24] is proposed to reduce the training cost of estimating the mapping functions for each speaker pair

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Audio, Speech, and Music Processing	Publication Date: Jun 29, 2017
Citations: 1	License type: open-access

R Discovery Prime

R Discovery Prime

Speaker-adaptive-trainable Boltzmann machine and its application to non-parallel voice conversion

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing

Lead the way for us

Similar Papers

Speaker adaptive model based on Boltzmann machine for non-parallel training in voice conversion
Toru Nakashika ... Yasuhiro Minami
-
Toru Nakashika, et. al.Toru Nakashika ... Yasuhiro Minami
01 Mar 2016
01 Mar 2016

Non-Parallel Training in Voice Conversion Using an Adaptive Restricted Boltzmann Machine
Toru Nakashika ... Tetsuya Takiguchi
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 24
Toru Nakashika, et. al.Toru Nakashika ... Tetsuya Takiguchi
01 Nov 2016
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 24

3WRBM-based speech factor modeling for arbitrary-source and non-parallel voice conversion
Toru Nakashika ... Yasuhiro Minami
-
Toru Nakashika, et. al.Toru Nakashika ... Yasuhiro Minami
01 Aug 2016
01 Aug 2016

Parallel-Data-Free Dictionary Learning for Voice Conversion Using Non-Negative Tucker Decomposition
Yuki Takashima ... Toru Nakashika
-
Yuki Takashima, et. al.Yuki Takashima ... Toru Nakashika
01 Apr 2018
01 Apr 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Speaker-adaptive-trainable Boltzmann machine and its application to non-parallel voice conversion

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing