Multimodal voice conversion using non-negative matrix factorization in noisy environments

Kenta Masaka,Tetsuya Takiguchi,Ryo Aihara,Yasuo Ariki

doi:10.1109/icassp.2014.6853856

Kenta Masaka, Tetsuya Takiguchi + Show 2 more

PDF Available

https://doi.org/10.1109/icassp.2014.6853856

Copy DOI

Export

Save

Cite

Publication Date: May 1, 2014

Citations: 11

Affiliation: Kobe University

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

This paper presents a multimodal voice conversion (VC) method for noisy environments. In our previous NMF-based VC method, source exemplars and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars obtained from the input signal, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this paper, we propose a multimodal VC that improves the noise robustness in our NMF-based VC method. By using the joint audio-visual features as source features, the performance of VC is improved compared to a previous audio-input NMF-based VC method. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.

Full Text