Voice conversion based on Non-negative matrix factorization using phoneme-categorized dictionary

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

We present in this paper an exemplar-based voice conversion (VC) method using a phoneme-categorized dictionary. Sparse representation-based VC using Non-negative matrix factorization (NMF) is employed for spectral conversion between different speakers. In our previous NMF-based VC method, source exemplars and target exemplars are extracted from parallel training data, having the same texts uttered by the source and target speakers. The input source signal is represented using the source exemplars and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. However, this exemplar-based approach needs to hold all the training exemplars (frames), and it may cause mismatching of phonemes between input signals and selected exemplars. In this paper, in order to reduce the mismatching of phoneme alignment, we propose a phoneme-categorized sub-dictionary and a dictionary selection method using NMF. By using the sub-dictionary, the performance of VC is improved compared to a conventional NMF-based VC. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method and a conventional NMF-based method.

Similar Papers
  • Conference Article
  • Cite Count Icon 2
  • 10.21437/interspeech.2014-295
Multimodal exemplar-based voice conversion using lip features in noisy environments
  • Sep 14, 2014
  • Kenta Masaka + 3 more

This paper presents a multimodal voice conversion (VC) method for noisy environments. In our previous exemplarbased VC method, source exemplars and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars obtained from the input signal, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this paper, we propose a multimodal VC method that improves the noise robustness of our previous exemplar-based VC method. As visual features, we use not only conventional DCT but also the features extracted from Active Appearance Model (AAM) applied to the lip area of a face image. Furthermore, we introduce the combination weight between audio and visual features and formulate a new cost function in order to estimate the audiovisual exemplars. By using the joint audio-visual features as source features, the VC performance is improved compared to a previous audio-input exemplar-based VC method. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method. Index Terms: voice conversion, multimodal, image features, non-negative matrix factorization, noise robustness

  • Conference Article
  • Cite Count Icon 11
  • 10.1109/icassp.2014.6853856
Multimodal voice conversion using non-negative matrix factorization in noisy environments
  • May 1, 2014
  • Kenta Masaka + 3 more

This paper presents a multimodal voice conversion (VC) method for noisy environments. In our previous NMF-based VC method, source exemplars and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars obtained from the input signal, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this paper, we propose a multimodal VC that improves the noise robustness in our NMF-based VC method. By using the joint audio-visual features as source features, the performance of VC is improved compared to a previous audio-input NMF-based VC method. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.1186/s13636-015-0067-4
Multimodal voice conversion based on non-negative matrix factorization
  • Sep 4, 2015
  • EURASIP Journal on Audio, Speech, and Music Processing
  • Kenta Masaka + 3 more

A multimodal voice conversion (VC) method for noisy environments is proposed. In our previous non-negative matrix factorization (NMF)-based VC method, source and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this study, we propose multimodal VC that improves the noise robustness of our NMF-based VC method. Furthermore, we introduce the combination weight between audio and visual features and formulate a new cost function to estimate audio-visual exemplars. Using the joint audio-visual features as source features, VC performance is improved compared with that of a previous audio-input exemplar-based VC method. The effectiveness of the proposed method is confirmed by comparing its effectiveness with that of a conventional audio-input NMF-based method and a Gaussian mixture model-based method.

  • Conference Article
  • 10.21437/interspeech.2015-579
Many-to-many voice conversion based on multiple non-negative matrix factorization
  • Sep 6, 2015
  • Ryo Aihara + 2 more

We present in this paper an exemplar-based Voice Conversion (VC) method using Non-negative Matrix Factorization (NMF), which is different from conventional statistical VC. NMF-based VC has advantages of noise robustness and naturalness of converted voice compared to Gaussian Mixture Model (GMM)based VC. However, because NMF-based VC is based on parallel training data of source and target speakers, we cannot convert the voice of arbitrary speakers in this framework. In this paper, we propose a many-to-many VC method that makes use of Multiple Non-negative Matrix Factorization (Multi-NMF). By using Multi-NMF, an arbitrary speaker’s voice is converted to another arbitrary speaker’s voice without the need for any input or output speaker training data. We assume that this method is flexible because we can adopt it to voice quality control or noise robust VC. Index Terms: voice conversion, speech synthesis, many-tomany, exemplar-based, NMF

  • Research Article
  • Cite Count Icon 5
  • 10.1145/2738048
Individuality-Preserving Voice Conversion for Articulation Disorders Using Phoneme-Categorized Exemplars
  • May 11, 2015
  • ACM Transactions on Accessible Computing
  • Ryo Aihara + 2 more

We present a voice conversion (VC) method for a person with an articulation disorder resulting from athetoid cerebral palsy. The movements of such speakers are limited by their athetoid symptoms and their consonants are often unstable or unclear, which makes it difficult for them to communicate. Exemplar-based spectral conversion using Nonnegative Matrix Factorization (NMF) is applied to a voice from a speaker with an articulation disorder. In our conventional work, we used a combined dictionary that was constructed from the source speaker’s vowels and the consonants from a target speaker without articulation disorders in order to preserve the speaker’s individuality. However, this conventional exemplar-based approach needs to use all the training exemplars (frames), and it may cause mismatching of phonemes between input signals and selected exemplars. In order to reduce the mismatching of phoneme alignment, we propose a phoneme-categorized subdictionary and a dictionary selection method using NMF. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based and a conventional exemplar-based method.

  • Conference Article
  • 10.1109/icnisc.2018.00052
Voice Conversion Based on Unified Dictionary with Clustered Features between Non-parallel Corpus
  • Apr 1, 2018
  • Jin Hui + 1 more

Non-negative matrix factorization (NMF) has been widely applied to exemplar-based voice conversion(VC) recently. It differs noise robustness and naturalness of the converted voice, compared with conventional statistical Gaussian mixture model-based VC. However, parallel training data from source and target speakers are required so it can not realize the arbitrary speakers' voice conversion, especially when the corpus of target speakers is inadequate. In this paper, we present a novel algorithm by clustering the spectral features in high dimensions to construct the unified dictionary and introduce a mapping matrix between source and target sparse coefficients. Experimental results demonstrate that the value of average cepstral distortion is 0.833 which is about 4.3% lower than the performance of conventional NMF based method. Subjective evaluations such as ABX and MOS are also discussed. It indicates that the speech quality in our study is quite better than conventional NMF. The target speaker's spectra are even unnecessary to be included in the training set.

  • Conference Article
  • Cite Count Icon 28
  • 10.1109/icassp.2013.6639230
Individuality-preserving voice conversion for articulation disorders based on non-negative matrix factorization
  • May 1, 2013
  • Ryo Aihara + 3 more

We present in this paper a voice conversion (VC) method for a person with an articulation disorder resulting from athetoid cerebral palsy. The movement of such speakers is limited by their athetoid symptoms, and their consonants are often unstable or unclear, which makes it difficult for them to communicate. In this paper, exemplar-based spectral conversion using Non-negative Matrix Factorization (NMF) is applied to a voice with an articulation disorder. To preserve the speaker's individuality, we used a combined dictionary that is constructed from the source speaker's vowels and target speaker's consonants. Experimental results indicate that the performance of NMF-based VC is considerably better than conventional GMM-based VC.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/waspaa.2015.7336943
Many-to-one voice conversion using exemplar-based sparse representation
  • Oct 1, 2015
  • Ryo Aihara + 2 more

Voice conversion (VC) is being widely researched in the field of speech processing because of increased interest in using such processing in applications such as personalized Text-to-Speech systems. We present in this paper a many-to-one VC method using exemplar-based sparse representation, which is different from conventional statistical VC. In our previous exemplar-based VC method, input speech was represented by the source dictionary and its sparse coefficients. The source and the target dictionaries are fully coupled and the converted voice is constructed from the source coefficients and the target dictionary. This method requires parallel exemplars (which consist of the source exemplars and target exemplars that have the same texts uttered by the source and target speakers) for dictionary construction. In this paper, we propose a many-to-one VC method in an exemplar-based framework which does not need training data of the source speaker. Some statistical approaches for many-to-one VC have been proposed; however, in the framework of exemplar-based VC, such a method has never been proposed. The effectiveness of our many-to-one VC has been confirmed by comparing its effectiveness with that of a conventional one-to-one NMF-based method and one-to-one GMM-based method.

  • Research Article
  • Cite Count Icon 44
  • 10.1109/tbme.2016.2644258
Joint Dictionary Learning-Based Non-Negative Matrix Factorization for Voice Conversion to Improve Speech Intelligibility After Oral Surgery.
  • Nov 1, 2017
  • IEEE Transactions on Biomedical Engineering
  • Szu-Wei Fu + 5 more

Objective: This paper focuses on machine learning based voice conversion (VC) techniques for improving the speech intelligibility of surgical patients who have had parts of their articulators removed. Because of the removal of parts of the articulator, a patient's speech may be distorted and difficult to understand. To overcome this problem, VC methods can be applied to convert the distorted speech such that it is clear and more intelligible. To design an effective VC method, two key points must be considered: 1) the amount of training data may be limited (because speaking for a long time is usually difficult for postoperative patients); 2) rapid conversion is desirable (for better communication). Methods: We propose a novel joint dictionary learning based non-negative matrix factorization (JD-NMF) algorithm. Compared to conventional VC techniques, JD-NMF can perform VC efficiently and effectively with only a small amount of training data. Results: The experimental results demonstrate that the proposed JD-NMF method not only achieves notably higher short-time objective intelligibility (STOI) scores (a standardized objective intelligibility evaluation metric) than those obtained using the original unconverted speech but is also significantly more efficient and effective than a conventional exemplar-based NMF VC method. Conclusion: The proposed JD-NMF method may outperform the state-of-the-art exemplar-based NMF VC method in terms of STOI scores under the desired scenario. Significance: We confirmed the advantages of the proposed joint training criterion for the NMF-based VC. Moreover, we verified that the proposed JD-NMF can effectively improve the speech intelligibility scores of oral surgery patients.Objective: This paper focuses on machine learning based voice conversion (VC) techniques for improving the speech intelligibility of surgical patients who have had parts of their articulators removed. Because of the removal of parts of the articulator, a patient's speech may be distorted and difficult to understand. To overcome this problem, VC methods can be applied to convert the distorted speech such that it is clear and more intelligible. To design an effective VC method, two key points must be considered: 1) the amount of training data may be limited (because speaking for a long time is usually difficult for postoperative patients); 2) rapid conversion is desirable (for better communication). Methods: We propose a novel joint dictionary learning based non-negative matrix factorization (JD-NMF) algorithm. Compared to conventional VC techniques, JD-NMF can perform VC efficiently and effectively with only a small amount of training data. Results: The experimental results demonstrate that the proposed JD-NMF method not only achieves notably higher short-time objective intelligibility (STOI) scores (a standardized objective intelligibility evaluation metric) than those obtained using the original unconverted speech but is also significantly more efficient and effective than a conventional exemplar-based NMF VC method. Conclusion: The proposed JD-NMF method may outperform the state-of-the-art exemplar-based NMF VC method in terms of STOI scores under the desired scenario. Significance: We confirmed the advantages of the proposed joint training criterion for the NMF-based VC. Moreover, we verified that the proposed JD-NMF can effectively improve the speech intelligibility scores of oral surgery patients.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 6
  • 10.1186/s13636-015-0075-4
Small-parallel exemplar-based voice conversion in noisy environments using affine non-negative matrix factorization
  • Nov 25, 2015
  • EURASIP Journal on Audio, Speech, and Music Processing
  • Ryo Aihara + 4 more

The need to have a large amount of parallel data is a large hurdle for the practical use of voice conversion (VC). This paper presents a novel framework of exemplar-based VC that only requires a small number of parallel exemplars. In our previous work, a VC technique using non-negative matrix factorization (NMF) for noisy environments was proposed. This method requires parallel exemplars (which consist of the source exemplars and target exemplars that have the same texts uttered by the source and target speakers) for dictionary construction. In the framework of conventional Gaussian mixture model (GMM)-based VC, some approaches that do not need parallel exemplars have been proposed. However, in the framework of exemplar-based VC for noisy environments, such a method has never been proposed. In this paper, an adaptation matrix in an NMF framework is introduced to adapt the source dictionary to the target dictionary. This adaptation matrix is estimated using only a small parallel speech corpus. We refer to this method as affine NMF, and the effectiveness of this method has been confirmed by comparing its effectiveness with that of a conventional NMF-based method and a GMM-based method in noisy environments.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 18
  • 10.1186/1687-4722-2014-5
A preliminary demonstration of exemplar-based voice conversion for articulation disorders using an individuality-preserving dictionary
  • Feb 1, 2014
  • EURASIP Journal on Audio, Speech, and Music Processing
  • Ryo Aihara + 3 more

We present in this paper a voice conversion (VC) method for a person with an articulation disorder resulting from athetoid cerebral palsy. The movement of such speakers is limited by their athetoid symptoms, and their consonants are often unstable or unclear, which makes it difficult for them to communicate. In this paper, exemplar-based spectral conversion using nonnegative matrix factorization (NMF) is applied to a voice with an articulation disorder. To preserve the speaker's individuality, we used an individuality-preserving dictionary that is constructed from the source speaker's vowels and target speaker's consonants. Using this dictionary, we can create a natural and clear voice preserving their voice's individuality. Experimental results indicate that the performance of NMF-based VC is considerably better than conventional GMM-based VC.

  • Conference Article
  • 10.21437/interspeech.2013-323
Exemplar-based individuality-preserving voice conversion for articulation disorders in noisy environments
  • Aug 25, 2013
  • Ryo Aihara + 3 more

We present in this paper a noise robust voice conversion (VC) method for a person with an articulation disorder resulting from athetoid cerebral palsy. The movements of such speakers are limited by their athetoid symptoms, and their consonants are often unstable or unclear, which makes it difficult for them to communicate. In this paper, exemplar-based spectral conversion using Non-negative Matrix Factorization (NMF) is applied to a voice with an articulation disorder in real noisy environments. In this paper, in order to deal with background noise, an input noisy source signal is decomposed into the clean source exemplars and noise exemplars by NMF. Also, to preserve the speaker’s individuality, we use a combined dictionary that was constructed from the source speaker’s vowels and target speaker’s consonants. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method. Index Terms: Voice Conversion, NMF, Articulation Disorders, Noise Robustness, Assistive Technologies

  • Conference Article
  • 10.1109/sii.2013.6776630
Voice conversion based on Non-negative Matrix Factorization in noisy environments
  • Dec 1, 2013
  • Takao Fujii + 4 more

This paper presents a voice conversion (VC) technique for noisy environments. We prepared parallel exemplars (dictionary) that consist of the source and target exemplars, which have the same texts uttered by the source and target speakers. The input source signal is decomposed into the source exemplars, noise exemplars obtained from the input signal, and their weights (activities). Then, the converted signal is obtained by calculating the linear combination of the target exemplars and the weights which are calculated using the source exemplars. In the proposed method, a Gaussian Mixture Model (GMM) -based conversion method is also applied to the feature vectors generated by the sparse coding in order to compensate a mismatch between the weights of source and target exemplars. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional method.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/apsipaasc47483.2019.9023264
Non-parallel Voice Conversion with Controllable Speaker Individuality using Variational Autoencoder
  • Nov 1, 2019
  • Tuan Vu Ho + 1 more

We propose a flexible non-parallel voice conversion (VC) system that is capable of both performing speaker adaptation and controlling speaker individuality. The proposed VC framework aims to tackle the inability to arbitrarily modify voice characteristics in the converted waveform of conventional VC model. To achieve this goal, we use the speaker embedding realized by a Variational Autoencoder (VAE) instead of one-hot encoded vectors to represent and modify the target voice's characteristics. Neither parallel training data, linguistic label nor time alignment procedure is required to train our system. After training on a multi-speaker speech database, the proposed VC system can adapt an arbitrary source speaker to any target speaker using only one sample from a target speaker. The speaker individuality of converted speech can be controlled by modifying the speaker embedding vectors; resulting in a fictitious speaker individuality. The experimental results showed that our proposed system is similar to conventional non-parallel VAE-based VC and better than the parallel Gaussian Mixture Model (GMM) in both perceived speech naturalness and speaker similarity; even when our system only uses one sample from target speaker. Moreover, our proposed system can convert a source voice to a fictitious target voice with well perceived speech naturalness of 3.1 MOS.

  • Conference Article
  • 10.1109/icassp.2018.8462569
Parallel-Data-Free Dictionary Learning for Voice Conversion Using Non-Negative Tucker Decomposition
  • Apr 1, 2018
  • Yuki Takashima + 4 more

Voice conversion (VC) is a technique where only speaker-specific information in source speech is converted while preserving the associated phonological information. Nonnegative Matrix Factorization (NMF)-based VC has been researched because of the natural-sounding voice it produces compared with conventional Gaussian Mixture Model-based VC. In conventional NMF- VC, parallel data are used to train the models; therefore, unnatural pre-processing of speech data to make parallel data is needed. NMF-VC also tends to be a large model because this method has many parallel exemplars for the dictionary matrix; therefore, the computational cost is high. In this paper, we propose a novel parallel dictionary learning method using non-negative Tucker decomposition (NTD) which uses tensor decomposition and decomposes an input observation into a set of mode matrices and one core tensor. Our proposed NTD-based dictionary learning method estimates the dictionary matrix for NMF- VC without using parallel data. Experimental results show that our proposed method outperforms conventional non-parallel VC methods.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.