Abstract

The need to have a large amount of parallel data is a large hurdle for the practical use of voice conversion (VC). This paper presents a novel framework of exemplar-based VC that only requires a small number of parallel exemplars. In our previous work, a VC technique using non-negative matrix factorization (NMF) for noisy environments was proposed. This method requires parallel exemplars (which consist of the source exemplars and target exemplars that have the same texts uttered by the source and target speakers) for dictionary construction. In the framework of conventional Gaussian mixture model (GMM)-based VC, some approaches that do not need parallel exemplars have been proposed. However, in the framework of exemplar-based VC for noisy environments, such a method has never been proposed. In this paper, an adaptation matrix in an NMF framework is introduced to adapt the source dictionary to the target dictionary. This adaptation matrix is estimated using only a small parallel speech corpus. We refer to this method as affine NMF, and the effectiveness of this method has been confirmed by comparing its effectiveness with that of a conventional NMF-based method and a GMM-based method in noisy environments.

Highlights

  • Background noise is an unavoidable factor in speech processing

  • In the task of automatic speech recognition (ASR), one problem is that the recognition performance remarkably decreases under noisy environments, and this creates a significant problem in regard to the development of the practical use of ASR

  • We proposed an exemplar-based method for noise-robust voice conversion (VC) using negative matrix factorization (NMF)

Read more

Summary

Introduction

Background noise is an unavoidable factor in speech processing. In the task of automatic speech recognition (ASR), one problem is that the recognition performance remarkably decreases under noisy environments, and this creates a significant problem in regard to the development of the practical use of ASR. The weights related to the source exemplars are taken, and the target signal is constructed from the target exemplars and the weights This method showed better performances than the conventional GMM-based method in speaker conversion experiments using noiseadded speech data. Because our NMF-based VC is not a statistical but an exemplar-based method, we assume that our approach can avoid the over-fitting problem and create a natural-sounding voice. In spite of these efforts, VC is not used in practice. Exemplar-based VC using NMF has been proposed in [5] In this framework, parallel training data is stored as source and target dictionaries. The NMF-based VC has been adapted for assistive technology for those with articulation disorders [18]

NMF-based voice conversion
Exemplar-based voice conversion using a small-parallel corpus
Experiments
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call