Abstract
A multimodal voice conversion (VC) method for noisy environments is proposed. In our previous non-negative matrix factorization (NMF)-based VC method, source and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this study, we propose multimodal VC that improves the noise robustness of our NMF-based VC method. Furthermore, we introduce the combination weight between audio and visual features and formulate a new cost function to estimate audio-visual exemplars. Using the joint audio-visual features as source features, VC performance is improved compared with that of a previous audio-input exemplar-based VC method. The effectiveness of the proposed method is confirmed by comparing its effectiveness with that of a conventional audio-input NMF-based method and a Gaussian mixture model-based method.
Highlights
Background noise is an unavoidable factor in speech processing
In automatic speech recognition (ASR) tasks, one problem is that recognition performance decreases significantly in noisy environments, which impedes the development of practical ASR applications
We propose a noise-robust voice conversion (VC) method that is based on sparse representations
Summary
Background noise is an unavoidable factor in speech processing. In automatic speech recognition (ASR) tasks, one problem is that recognition performance decreases significantly in noisy environments, which impedes the development of practical ASR applications. The noise in the input signal is output with the converted signal and degrades conversion performance because of unexpected mapping of source features. To address this problem, we propose a noise-robust VC method that is based on sparse representations. The input noisy audio-visual feature is represented by a linear combination of source and noise exemplars. Noise-robust VC is required for real environments because noise in the input signal may degrade conversion performance due to unexpected mapping of source features. We evaluate our multimodal VC using continuous digital utterances which have been used in most studies related to audio-visual signal processing.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have