Multimodal voice conversion based on non-negative matrix factorization

Kenta Masaka,Tetsuya Takiguchi,Yasuo Ariki,Ryo Aihara

doi:10.1186/s13636-015-0067-4

Kenta Masaka, Tetsuya Takiguchi + Show 2 more

Open Access

PDF Available

https://doi.org/10.1186/s13636-015-0067-4

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

A multimodal voice conversion (VC) method for noisy environments is proposed. In our previous non-negative matrix factorization (NMF)-based VC method, source and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this study, we propose multimodal VC that improves the noise robustness of our NMF-based VC method. Furthermore, we introduce the combination weight between audio and visual features and formulate a new cost function to estimate audio-visual exemplars. Using the joint audio-visual features as source features, VC performance is improved compared with that of a previous audio-input exemplar-based VC method. The effectiveness of the proposed method is confirmed by comparing its effectiveness with that of a conventional audio-input NMF-based method and a Gaussian mixture model-based method.

Highlights

Background noise is an unavoidable factor in speech processing
In automatic speech recognition (ASR) tasks, one problem is that recognition performance decreases significantly in noisy environments, which impedes the development of practical ASR applications
We propose a noise-robust voice conversion (VC) method that is based on sparse representations

Summary

Introduction

Background noise is an unavoidable factor in speech processing. In automatic speech recognition (ASR) tasks, one problem is that recognition performance decreases significantly in noisy environments, which impedes the development of practical ASR applications. The noise in the input signal is output with the converted signal and degrades conversion performance because of unexpected mapping of source features. To address this problem, we propose a noise-robust VC method that is based on sparse representations. The input noisy audio-visual feature is represented by a linear combination of source and noise exemplars. Noise-robust VC is required for real environments because noise in the input signal may degrade conversion performance due to unexpected mapping of source features. We evaluate our multimodal VC using continuous digital utterances which have been used in most studies related to audio-visual signal processing.

Multimodal voice conversion

Estimation of activity from noisy source signals using

Target speech construction

Experimental results

Conclusions

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Audio, Speech, and Music Processing	Publication Date: Sep 4, 2015
Citations: 2	License type: CC BY 4.0

R Discovery Prime

Multimodal voice conversion based on non-negative matrix factorization

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing

Lead the way for us

Similar Papers

Multimodal exemplar-based voice conversion using lip features in noisy environments
Kenta Masaka ... Ryo Aihara
-
Kenta Masaka, et. al.Kenta Masaka ... Ryo Aihara
14 Sep 2014
14 Sep 2014

Multimodal voice conversion using non-negative matrix factorization in noisy environments
Kenta Masaka ... Yasuo Ariki
-
Kenta Masaka, et. al.Kenta Masaka ... Yasuo Ariki
01 May 2014
01 May 2014

Voice conversion based on Non-negative matrix factorization using phoneme-categorized dictionary
Ryo Aihara ... Toru Nakashika
-
Ryo Aihara, et. al.Ryo Aihara ... Toru Nakashika
01 May 2014
01 May 2014

Many-to-one voice conversion using exemplar-based sparse representation
Ryo Aihara ... Tetsuya Takiguchi
-
Ryo Aihara, et. al.Ryo Aihara ... Tetsuya Takiguchi
01 Oct 2015
01 Oct 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Multimodal voice conversion based on non-negative matrix factorization

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing