A novel method for voice conversion based on non-parallel corpus

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

This article puts forward a new algorithm for voice conversion which not only removes the necessity of parallel corpus in the training phase but also resolves the issue of insufficiency of the target speaker’s corpus. The proposed approach is based on one of the new voice conversion models utilizing classical LPC analysis-synthesis model combined with GMM. Through this algorithm, the conversion functions among vowels and demi-syllables are derived. We assumed that these functions are rather the same for different speakers if their genders, accents, and languages are alike. Therefore, we will be able to produce the demi-syllables with just having access to few sentences from the target speaker and forming the GMM for one of his/her vowels. The results from the appraisal of the proposed method for voice conversion clarifies that this method has the ability to efficiently realize the speech features of the target speaker. It can also provide results comparable to the ones obtained through the parallel-corpus-based approaches.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.1186/s13636-015-0067-4
Multimodal voice conversion based on non-negative matrix factorization
  • Sep 4, 2015
  • EURASIP Journal on Audio, Speech, and Music Processing
  • Kenta Masaka + 3 more

A multimodal voice conversion (VC) method for noisy environments is proposed. In our previous non-negative matrix factorization (NMF)-based VC method, source and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this study, we propose multimodal VC that improves the noise robustness of our NMF-based VC method. Furthermore, we introduce the combination weight between audio and visual features and formulate a new cost function to estimate audio-visual exemplars. Using the joint audio-visual features as source features, VC performance is improved compared with that of a previous audio-input exemplar-based VC method. The effectiveness of the proposed method is confirmed by comparing its effectiveness with that of a conventional audio-input NMF-based method and a Gaussian mixture model-based method.

  • Conference Article
  • Cite Count Icon 37
  • 10.1109/icassp.2004.1325907
Non-parallel training for voice conversion by maximum likelihood constrained adaptation
  • Nov 19, 2004
  • A Mouchtaris + 2 more

The objective of voice conversion methods is to modify the speech characteristics of a particular speaker in such manner, as to sound like speech by a different target speaker. Current voice conversion algorithms are based on deriving a conversion function by estimating its parameters through a corpus that contains the same utterances spoken by both speakers. Such a corpus, usually referred to as a parallel corpus, has the disadvantage that many times it is difficult or even impossible to collect. Here, we propose a voice conversion method that does not require a parallel corpus for training, i.e. the spoken utterances by the two speakers need not be the same, by employing speaker adaptation techniques to adapt to a particular pair of source and target speakers, the derived conversion parameters from a different pair of speakers. We show that adaptation reduces the error obtained when simply applying the conversion parameters of one pair of speakers to another by a factor that can reach 30% in many cases, and with performance comparable with the ideal case when a parallel corpus is available.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.3390/app14104251
Wav2wav: Wave-to-Wave Voice Conversion
  • May 17, 2024
  • Applied Sciences
  • Changhyeon Jeong + 3 more

Voice conversion is the task of changing the speaker characteristics of input speech while preserving its linguistic content. It can be used in various areas, such as entertainment, medicine, and education. The quality of the converted speech is crucial for voice conversion algorithms to be useful in these various applications. Deep learning-based voice conversion algorithms, which have been showing promising results recently, generally consist of three modules: a feature extractor, feature converter, and vocoder. The feature extractor accepts the waveform as the input and extracts speech feature vectors for further processing. These speech feature vectors are later synthesized back into waveforms by the vocoder. The feature converter module performs the actual voice conversion; therefore, many previous studies separately focused on improving this module. These works combined the separately trained vocoder to synthesize the final waveform. Since the feature converter and the vocoder are trained independently, the output of the converter may not be compatible with the input of the vocoder, which causes performance degradation. Furthermore, most voice conversion algorithms utilize mel-spectrogram-based speech feature vectors without modification. These feature vectors have performed well in a variety of speech-processing areas but could be further optimized for voice conversion tasks. To address these problems, we propose a novel wave-to-wave (wav2wav) voice conversion method that integrates the feature extractor, the feature converter, and the vocoder into a single module and trains the system in an end-to-end manner. We evaluated the efficiency of the proposed method using the VCC2018 dataset.

  • Conference Article
  • 10.21437/interspeech.2015-579
Many-to-many voice conversion based on multiple non-negative matrix factorization
  • Sep 6, 2015
  • Ryo Aihara + 2 more

We present in this paper an exemplar-based Voice Conversion (VC) method using Non-negative Matrix Factorization (NMF), which is different from conventional statistical VC. NMF-based VC has advantages of noise robustness and naturalness of converted voice compared to Gaussian Mixture Model (GMM)based VC. However, because NMF-based VC is based on parallel training data of source and target speakers, we cannot convert the voice of arbitrary speakers in this framework. In this paper, we propose a many-to-many VC method that makes use of Multiple Non-negative Matrix Factorization (Multi-NMF). By using Multi-NMF, an arbitrary speaker’s voice is converted to another arbitrary speaker’s voice without the need for any input or output speaker training data. We assume that this method is flexible because we can adopt it to voice quality control or noise robust VC. Index Terms: voice conversion, speech synthesis, many-tomany, exemplar-based, NMF

  • Conference Article
  • 10.1109/ispacs.2005.1595339
Statistical eigenvoice: speaker features within S+N framework and a way towards language-independent voice conversion
  • Jan 1, 2005
  • Feng Huang + 1 more

This paper presents a statistical method for speaker feature extraction and voice conversion within sinusoidal + noise (S+N) modeling framework. With fundamental researches on speaker characteristics embedded in the parameter sets of S+N model, we found the vector sets of statistical eigenvoice (SEV) and weighted statistical eigenvoice (wSEV), which are basis vectors of GMM representation, have significant properties: approximately speaker-dependent and language-independent. Piered by the feature vectors of SEV and wSEV, we address a new algorithm for context-free voice conversion. Subjective tests suggest that the SEV-based method achieves convincing results while maintaining high synthesis quality in comparison to the traditional LPC approaches.

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.engappai.2024.109071
Target speaker filtration by mask estimation for source speaker traceability in voice conversion
  • Aug 3, 2024
  • Engineering Applications of Artificial Intelligence
  • Junfei Zhang + 5 more

Target speaker filtration by mask estimation for source speaker traceability in voice conversion

  • Conference Article
  • 10.1109/eusipco.2016.7760320
3WRBM-based speech factor modeling for arbitrary-source and non-parallel voice conversion
  • Aug 1, 2016
  • Toru Nakashika + 1 more

In recent years, voice conversion (VC) becomes a popular technique since it can be applied to various speech tasks. Most existing approaches on VC must use aligned speech pairs (parallel data) of the source speaker and the target speaker in training, which makes hard to handle it. Furthermore, VC methods proposed so far require to specify the source speaker in conversion stage, even though we just want to obtain the speech of the target speaker from the other speakers in many cases of VC. In this paper, we propose a VC method where it is not necessary to use any parallel data in the training, nor to specify the source speaker in the conversion. Our approach models a joint probability of acoustic, phonetic, and speaker features using a three-way restricted Boltzmann machine (3WRBM). Speaker-independent (SI) and speaker-dependent (SD) parameters in our model are simultaneously estimated under the maximum likelihood (ML) criteria using a speech set of multiple speakers. In conversion stage, phonetic features are at first estimated in a probabilistic manner given a speech of an arbitrary speaker, then a voice-converted speech is produced using the SD parameters of the target speaker. Our experimental results showed not only that our approach outperformed other non-parallel VC methods, but that the performance of the arbitrary-source VC was close to those of the traditional source-specified VC in our approach.

  • Conference Article
  • Cite Count Icon 2
  • 10.21437/interspeech.2014-295
Multimodal exemplar-based voice conversion using lip features in noisy environments
  • Sep 14, 2014
  • Kenta Masaka + 3 more

This paper presents a multimodal voice conversion (VC) method for noisy environments. In our previous exemplarbased VC method, source exemplars and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars obtained from the input signal, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this paper, we propose a multimodal VC method that improves the noise robustness of our previous exemplar-based VC method. As visual features, we use not only conventional DCT but also the features extracted from Active Appearance Model (AAM) applied to the lip area of a face image. Furthermore, we introduce the combination weight between audio and visual features and formulate a new cost function in order to estimate the audiovisual exemplars. By using the joint audio-visual features as source features, the VC performance is improved compared to a previous audio-input exemplar-based VC method. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method. Index Terms: voice conversion, multimodal, image features, non-negative matrix factorization, noise robustness

  • Research Article
  • Cite Count Icon 18
  • 10.1016/j.specom.2014.12.004
Voice conversion based on feature combination with limited training data
  • Dec 13, 2014
  • Speech Communication
  • Mostafa Ghorbandoost + 5 more

Voice conversion based on feature combination with limited training data

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icassp49357.2023.10096901
JSV-VC: Jointly Trained Speaker Verification and Voice Conversion Models
  • Jun 4, 2023
  • Shogo Seki + 3 more

This paper proposes a variational autoencoder (VAE)-based method for voice conversion (VC) on arbitrary source-target speaker pairs without parallel corpora, i.e., non-parallel any-to-any VC. One typical approach is to use speaker embeddings obtained from a speaker verification (SV) model as the condition for a VC model. However, converted speech is not guaranteed to reflect a target speaker’s characteristics in a naive combination of VC and SV models. Moreover, speaker embeddings are not designed for VC problems, leading to suboptimal conversion performance. To address these issues, the proposed method, JSV-VC, trains both VC and SV models jointly. The VC model is trained so that converted speech is verified as the target speaker in the SV model, while the SV model is trained in order to output consistent embeddings before and after the VC model. The experimental evaluation reveals that JSV-VC outperforms conventional any-to-any VC methods quantitatively and qualitatively.

  • Research Article
  • Cite Count Icon 107
  • 10.1109/tsa.2005.857790
Nonparallel training for voice conversion based on a parameter adaptation approach
  • May 1, 2006
  • IEEE Transactions on Audio, Speech and Language Processing
  • A Mouchtaris + 2 more

The objective of voice conversion algorithms is to modify the speech by a particular source speaker so that it sounds as if spoken by a different target speaker. Current conversion algorithms employ a training procedure, during which the same utterances spoken by both the source and target speakers are needed for deriving the desired conversion parameters. Such a (parallel) corpus, is often difficult or impossible to collect. Here, we propose an algorithm that relaxes this constraint, i.e., the training corpus does not necessarily contain the same utterances from both speakers. The proposed algorithm is based on speaker adaptation techniques, adapting the conversion parameters derived for a particular pair of speakers to a different pair, for which only a nonparallel corpus is available. We show that adaptation reduces the error obtained when simply applying the conversion parameters of one pair of speakers to another by a factor that can reach 30%. A speaker identification measure is also employed that more insightfully portrays the importance of adaptation, while listening tests confirm the success of our method. Both the objective and subjective tests employed, demonstrate that the proposed algorithm achieves comparable results with the ideal case when a parallel corpus is available.

  • Conference Article
  • Cite Count Icon 14
  • 10.1109/icassp.2007.366960
Conditional Vector Quantization for Voice Conversion
  • Apr 1, 2007
  • A Mouchtaris + 2 more

Voice conversion methods have the objective of transforming speech spoken by a particular source speaker, so that it sounds as if spoken by a different target speaker. The majority of voice conversion methods is based on transforming the short-time spectral envelope of the source speaker, based on derived correspondences between the source and target vectors using training speech data from both speakers. These correspondences are usually obtained by segmenting the spectral vectors of one or both speakers into clusters, using soft (GMM-based) or hard (VQ-based) clustering. Here, we propose that voice conversion performance can be improved by taking advantage of the fact that often the relationship between the source and target vectors is one-to-many. In order to illustrate this, we propose that a VQ approach namely constrained vector quantization (CVQ), can be used for voice conversion. Results indicate that indeed such a relationship between the source and target data exists and can be exploited by following a CVQ-based function for voice conversion.

  • Research Article
  • Cite Count Icon 17
  • 10.1109/tasl.2009.2035029
Synthesis of Child Speech With HMM Adaptation and Voice Conversion
  • Jul 1, 2010
  • IEEE Transactions on Audio, Speech, and Language Processing
  • Oliver Watts + 3 more

The synthesis of child speech presents challenges both in the collection of data and in the building of a synthesizer from that data. We chose to build a statistical parametric synthesizer using the hidden Markov model (HMM)-based system HTS, as this technique has previously been shown to perform well for limited amounts of data, and for data collected under imperfect conditions. Six different configurations of the synthesizer were compared, using both speaker-dependent and speaker-adaptive modeling techniques, and using varying amounts of data. For comparison with HMM adaptation, techniques from voice conversion were used to transform existing synthesizers to the characteristics of the target speaker. Speaker-adaptive voices generally outperformed child speaker-dependent voices in the evaluation. HMM adaptation outperformed voice conversion style techniques when using the full target speaker corpus; with fewer adaptation data, however, no significant listener preference for either HMM adaptation or voice conversion methods was found.

  • Research Article
  • Cite Count Icon 8
  • 10.1109/taslp.2016.2522643
Multiple Non-Negative Matrix Factorization for Many-to-Many Voice Conversion
  • Jul 1, 2016
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • Ryo Aihara + 2 more

A novel voice conversion (VC) method for arbitrary speakers is proposed. Non-negative matrix factorization (NMF) has recently been applied to exemplar-based VC. It offers noise robustness and naturalness of the converted voice, compared with widely used Gaussian mixture model-based VC. However, because NMF-based VC requires parallel training data from source and target speakers, the voice of arbitrary speakers cannot be converted in this framework. In this study, we propose the multiple non-negative matrix factorization (Multi-NMF) to allow the implementation of many-to-many, exemplar-based VC. Our experimental results demonstrate that the conversion quality of the proposed method is close to that of conventional one-to-one VC, even though the proposed method requires neither the source speakers' spectra, nor the target speakers' spectra, to be included in the training set.

  • Research Article
  • Cite Count Icon 21
  • 10.1007/s11042-015-3039-x
High quality voice conversion using prosodic and high-resolution spectral features
  • Nov 19, 2015
  • Multimedia Tools and Applications
  • Hy Quy Nguyen + 4 more

Voice conversion methods have advanced rapidly over the last decade. Studies have shown that speaker characteristics are captured by spectral feature as well as various prosodic features. Most existing conversion methods focus on the spectral feature as it directly represents the timbre characteristics, while some conversion methods have focused only on the prosodic feature represented by the fundamental frequency. In this paper, a comprehensive framework using deep neural networks to convert both timbre and prosodic features is proposed. The timbre feature is represented by a high-resolution spectral feature. The prosodic features include F0, intensity and duration. It is well known that DNN is useful as a tool to model high-dimensional features. In this work, we show that DNN initialized by our proposed autoencoder pretraining yields good quality DNN conversion models. This pretraining is tailor-made for voice conversion and leverages on autoencoder to capture the generic spectral shape of source speech. Additionally, our framework uses segmental DNN models to capture the evolution of the prosodic features over time. To reconstruct the converted speech, the spectral feature produced by the DNN model is combined with the three prosodic features produced by the DNN segmental models. Our experimental results show that the application of both prosodic and high-resolution spectral features leads to quality converted speech as measured by objective evaluation and subjective listening tests.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.