Structured Sparse Spectral Transforms and Structural Measures for Voice Conversion.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

We investigate a structured sparse spectral transform method for voice conversion (VC) to perform frequency warping and spectral shaping simultaneously on high-dimensional (D) STRAIGHT spectra. Learning a large transform matrix for high-D data often results in an overfit matrix with low sparsity, which leads to muffled speech in VC. We address this problem by using the frequency-warping characteristic of a source-target speaker pair to define a region of support (ROS) in a transform matrix, and further optimize it by nonnegative matrix factorization (NMF) to obtain structured sparse transform. We also investigate structural measures of spectral and temporal covariance and variance at different scales for assessing VC speech quality. Our experiments on ARCTIC dataset of 12 speaker pairs show that embedding the ROS in spectral transforms offers flexibility in tradeoffs between spectral distortion and structure preservation, and the structural measures provide quantitatively reasonable results on converted speech. Our subjective listening tests show that the proposed VC method achieves a mean opinion score of "very good" relative to natural speech, and in comparison with three other VC methods, it is the most preferred one in naturalness and in voice similarity to target speakers.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.1186/s13636-015-0067-4
Multimodal voice conversion based on non-negative matrix factorization
  • Sep 4, 2015
  • EURASIP Journal on Audio, Speech, and Music Processing
  • Kenta Masaka + 3 more

A multimodal voice conversion (VC) method for noisy environments is proposed. In our previous non-negative matrix factorization (NMF)-based VC method, source and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this study, we propose multimodal VC that improves the noise robustness of our NMF-based VC method. Furthermore, we introduce the combination weight between audio and visual features and formulate a new cost function to estimate audio-visual exemplars. Using the joint audio-visual features as source features, VC performance is improved compared with that of a previous audio-input exemplar-based VC method. The effectiveness of the proposed method is confirmed by comparing its effectiveness with that of a conventional audio-input NMF-based method and a Gaussian mixture model-based method.

  • Conference Article
  • 10.21437/interspeech.2015-579
Many-to-many voice conversion based on multiple non-negative matrix factorization
  • Sep 6, 2015
  • Ryo Aihara + 2 more

We present in this paper an exemplar-based Voice Conversion (VC) method using Non-negative Matrix Factorization (NMF), which is different from conventional statistical VC. NMF-based VC has advantages of noise robustness and naturalness of converted voice compared to Gaussian Mixture Model (GMM)based VC. However, because NMF-based VC is based on parallel training data of source and target speakers, we cannot convert the voice of arbitrary speakers in this framework. In this paper, we propose a many-to-many VC method that makes use of Multiple Non-negative Matrix Factorization (Multi-NMF). By using Multi-NMF, an arbitrary speaker’s voice is converted to another arbitrary speaker’s voice without the need for any input or output speaker training data. We assume that this method is flexible because we can adopt it to voice quality control or noise robust VC. Index Terms: voice conversion, speech synthesis, many-tomany, exemplar-based, NMF

  • Research Article
  • Cite Count Icon 44
  • 10.1109/tbme.2016.2644258
Joint Dictionary Learning-Based Non-Negative Matrix Factorization for Voice Conversion to Improve Speech Intelligibility After Oral Surgery.
  • Nov 1, 2017
  • IEEE Transactions on Biomedical Engineering
  • Szu-Wei Fu + 5 more

Objective: This paper focuses on machine learning based voice conversion (VC) techniques for improving the speech intelligibility of surgical patients who have had parts of their articulators removed. Because of the removal of parts of the articulator, a patient's speech may be distorted and difficult to understand. To overcome this problem, VC methods can be applied to convert the distorted speech such that it is clear and more intelligible. To design an effective VC method, two key points must be considered: 1) the amount of training data may be limited (because speaking for a long time is usually difficult for postoperative patients); 2) rapid conversion is desirable (for better communication). Methods: We propose a novel joint dictionary learning based non-negative matrix factorization (JD-NMF) algorithm. Compared to conventional VC techniques, JD-NMF can perform VC efficiently and effectively with only a small amount of training data. Results: The experimental results demonstrate that the proposed JD-NMF method not only achieves notably higher short-time objective intelligibility (STOI) scores (a standardized objective intelligibility evaluation metric) than those obtained using the original unconverted speech but is also significantly more efficient and effective than a conventional exemplar-based NMF VC method. Conclusion: The proposed JD-NMF method may outperform the state-of-the-art exemplar-based NMF VC method in terms of STOI scores under the desired scenario. Significance: We confirmed the advantages of the proposed joint training criterion for the NMF-based VC. Moreover, we verified that the proposed JD-NMF can effectively improve the speech intelligibility scores of oral surgery patients.Objective: This paper focuses on machine learning based voice conversion (VC) techniques for improving the speech intelligibility of surgical patients who have had parts of their articulators removed. Because of the removal of parts of the articulator, a patient's speech may be distorted and difficult to understand. To overcome this problem, VC methods can be applied to convert the distorted speech such that it is clear and more intelligible. To design an effective VC method, two key points must be considered: 1) the amount of training data may be limited (because speaking for a long time is usually difficult for postoperative patients); 2) rapid conversion is desirable (for better communication). Methods: We propose a novel joint dictionary learning based non-negative matrix factorization (JD-NMF) algorithm. Compared to conventional VC techniques, JD-NMF can perform VC efficiently and effectively with only a small amount of training data. Results: The experimental results demonstrate that the proposed JD-NMF method not only achieves notably higher short-time objective intelligibility (STOI) scores (a standardized objective intelligibility evaluation metric) than those obtained using the original unconverted speech but is also significantly more efficient and effective than a conventional exemplar-based NMF VC method. Conclusion: The proposed JD-NMF method may outperform the state-of-the-art exemplar-based NMF VC method in terms of STOI scores under the desired scenario. Significance: We confirmed the advantages of the proposed joint training criterion for the NMF-based VC. Moreover, we verified that the proposed JD-NMF can effectively improve the speech intelligibility scores of oral surgery patients.

  • Conference Article
  • Cite Count Icon 37
  • 10.1109/icassp.2004.1325907
Non-parallel training for voice conversion by maximum likelihood constrained adaptation
  • Nov 19, 2004
  • A Mouchtaris + 2 more

The objective of voice conversion methods is to modify the speech characteristics of a particular speaker in such manner, as to sound like speech by a different target speaker. Current voice conversion algorithms are based on deriving a conversion function by estimating its parameters through a corpus that contains the same utterances spoken by both speakers. Such a corpus, usually referred to as a parallel corpus, has the disadvantage that many times it is difficult or even impossible to collect. Here, we propose a voice conversion method that does not require a parallel corpus for training, i.e. the spoken utterances by the two speakers need not be the same, by employing speaker adaptation techniques to adapt to a particular pair of source and target speakers, the derived conversion parameters from a different pair of speakers. We show that adaptation reduces the error obtained when simply applying the conversion parameters of one pair of speakers to another by a factor that can reach 30% in many cases, and with performance comparable with the ideal case when a parallel corpus is available.

  • Conference Article
  • Cite Count Icon 2
  • 10.21437/interspeech.2014-295
Multimodal exemplar-based voice conversion using lip features in noisy environments
  • Sep 14, 2014
  • Kenta Masaka + 3 more

This paper presents a multimodal voice conversion (VC) method for noisy environments. In our previous exemplarbased VC method, source exemplars and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars obtained from the input signal, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this paper, we propose a multimodal VC method that improves the noise robustness of our previous exemplar-based VC method. As visual features, we use not only conventional DCT but also the features extracted from Active Appearance Model (AAM) applied to the lip area of a face image. Furthermore, we introduce the combination weight between audio and visual features and formulate a new cost function in order to estimate the audiovisual exemplars. By using the joint audio-visual features as source features, the VC performance is improved compared to a previous audio-input exemplar-based VC method. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method. Index Terms: voice conversion, multimodal, image features, non-negative matrix factorization, noise robustness

  • Conference Article
  • 10.1109/icassp.2018.8462569
Parallel-Data-Free Dictionary Learning for Voice Conversion Using Non-Negative Tucker Decomposition
  • Apr 1, 2018
  • Yuki Takashima + 4 more

Voice conversion (VC) is a technique where only speaker-specific information in source speech is converted while preserving the associated phonological information. Nonnegative Matrix Factorization (NMF)-based VC has been researched because of the natural-sounding voice it produces compared with conventional Gaussian Mixture Model-based VC. In conventional NMF- VC, parallel data are used to train the models; therefore, unnatural pre-processing of speech data to make parallel data is needed. NMF-VC also tends to be a large model because this method has many parallel exemplars for the dictionary matrix; therefore, the computational cost is high. In this paper, we propose a novel parallel dictionary learning method using non-negative Tucker decomposition (NTD) which uses tensor decomposition and decomposes an input observation into a set of mode matrices and one core tensor. Our proposed NTD-based dictionary learning method estimates the dictionary matrix for NMF- VC without using parallel data. Experimental results show that our proposed method outperforms conventional non-parallel VC methods.

  • Conference Article
  • Cite Count Icon 35
  • 10.1109/icassp.2014.6855137
Voice conversion based on Non-negative matrix factorization using phoneme-categorized dictionary
  • May 1, 2014
  • Ryo Aihara + 3 more

We present in this paper an exemplar-based voice conversion (VC) method using a phoneme-categorized dictionary. Sparse representation-based VC using Non-negative matrix factorization (NMF) is employed for spectral conversion between different speakers. In our previous NMF-based VC method, source exemplars and target exemplars are extracted from parallel training data, having the same texts uttered by the source and target speakers. The input source signal is represented using the source exemplars and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. However, this exemplar-based approach needs to hold all the training exemplars (frames), and it may cause mismatching of phonemes between input signals and selected exemplars. In this paper, in order to reduce the mismatching of phoneme alignment, we propose a phoneme-categorized sub-dictionary and a dictionary selection method using NMF. By using the sub-dictionary, the performance of VC is improved compared to a conventional NMF-based VC. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method and a conventional NMF-based method.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/icassp.2016.7472663
Semi-non-negative matrix factorization using alternating direction method of multipliers for voice conversion
  • Mar 1, 2016
  • Ryo Aihara + 2 more

Voice conversion (VC) is being widely researched in the field of speech processing because of increased interest in using such processing in applications such as personalized Text-To-Speech systems. A VC method using Non-negative Matrix Factorization (NMF) has been researched because of its natural sounding voice, however, huge memory usage and high computational times have been reported as problems. We present in this paper a new VC method using Semi-Non-negative Matrix Factorization (Semi-NMF) using the Alternating Direction Method of Multipliers (ADMM) in order to tackle the problems associated with NMF-based VC. Dictionary learning using Semi-NMF can create a compact dictionary, and ADMM enables faster convergence than conventional Semi-NMF. Experimental results show that our proposed method is 76 times faster than conventional NMF, and its conversion quality is almost the same as that of the conventional method.

  • Conference Article
  • 10.1109/eusipco.2016.7760320
3WRBM-based speech factor modeling for arbitrary-source and non-parallel voice conversion
  • Aug 1, 2016
  • Toru Nakashika + 1 more

In recent years, voice conversion (VC) becomes a popular technique since it can be applied to various speech tasks. Most existing approaches on VC must use aligned speech pairs (parallel data) of the source speaker and the target speaker in training, which makes hard to handle it. Furthermore, VC methods proposed so far require to specify the source speaker in conversion stage, even though we just want to obtain the speech of the target speaker from the other speakers in many cases of VC. In this paper, we propose a VC method where it is not necessary to use any parallel data in the training, nor to specify the source speaker in the conversion. Our approach models a joint probability of acoustic, phonetic, and speaker features using a three-way restricted Boltzmann machine (3WRBM). Speaker-independent (SI) and speaker-dependent (SD) parameters in our model are simultaneously estimated under the maximum likelihood (ML) criteria using a speech set of multiple speakers. In conversion stage, phonetic features are at first estimated in a probabilistic manner given a speech of an arbitrary speaker, then a voice-converted speech is produced using the SD parameters of the target speaker. Our experimental results showed not only that our approach outperformed other non-parallel VC methods, but that the performance of the arbitrary-source VC was close to those of the traditional source-specified VC in our approach.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/icme.2015.7177437
Sparse nonlinear representation for voice conversion
  • Jun 1, 2015
  • Toru Nakashika + 2 more

In voice conversion, sparse-representation-based methods have recently been garnering attention because they are, relatively speaking, not affected by over-fitting or over-smoothing problems. In these approaches, voice conversion is achieved by estimating a sparse vector that determines which dictionaries of the target speaker should be used, calculated from the matching of the input vector and dictionaries of the source speaker. The sparse-representation-based voice conversion methods can be broadly divided into two approaches: 1) an approach that uses raw acoustic features in the training data as parallel dictionaries, and 2) an approach that trains parallel dictionaries from the training data. In our approach, we follow the latter approach and systematically estimate the parallel dictionaries using a joint-density restricted Boltzmann machine with sparse constraints. Through voice-conversion experiments, we confirmed the high-performance of our method, comparing it with the conventional Gaussian mixture model (GMM)-based approach, and a non-negative matrix factorization (NMF)-based approach, which is based on sparse representation.

  • Conference Article
  • Cite Count Icon 11
  • 10.1109/icassp.2014.6853856
Multimodal voice conversion using non-negative matrix factorization in noisy environments
  • May 1, 2014
  • Kenta Masaka + 3 more

This paper presents a multimodal voice conversion (VC) method for noisy environments. In our previous NMF-based VC method, source exemplars and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars obtained from the input signal, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this paper, we propose a multimodal VC that improves the noise robustness in our NMF-based VC method. By using the joint audio-visual features as source features, the performance of VC is improved compared to a previous audio-input NMF-based VC method. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.

  • Conference Article
  • 10.21437/interspeech.2013-323
Exemplar-based individuality-preserving voice conversion for articulation disorders in noisy environments
  • Aug 25, 2013
  • Ryo Aihara + 3 more

We present in this paper a noise robust voice conversion (VC) method for a person with an articulation disorder resulting from athetoid cerebral palsy. The movements of such speakers are limited by their athetoid symptoms, and their consonants are often unstable or unclear, which makes it difficult for them to communicate. In this paper, exemplar-based spectral conversion using Non-negative Matrix Factorization (NMF) is applied to a voice with an articulation disorder in real noisy environments. In this paper, in order to deal with background noise, an input noisy source signal is decomposed into the clean source exemplars and noise exemplars by NMF. Also, to preserve the speaker’s individuality, we use a combined dictionary that was constructed from the source speaker’s vowels and target speaker’s consonants. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method. Index Terms: Voice Conversion, NMF, Articulation Disorders, Noise Robustness, Assistive Technologies

  • Research Article
  • Cite Count Icon 8
  • 10.1109/taslp.2016.2522643
Multiple Non-Negative Matrix Factorization for Many-to-Many Voice Conversion
  • Jul 1, 2016
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • Ryo Aihara + 2 more

A novel voice conversion (VC) method for arbitrary speakers is proposed. Non-negative matrix factorization (NMF) has recently been applied to exemplar-based VC. It offers noise robustness and naturalness of the converted voice, compared with widely used Gaussian mixture model-based VC. However, because NMF-based VC requires parallel training data from source and target speakers, the voice of arbitrary speakers cannot be converted in this framework. In this study, we propose the multiple non-negative matrix factorization (Multi-NMF) to allow the implementation of many-to-many, exemplar-based VC. Our experimental results demonstrate that the conversion quality of the proposed method is close to that of conventional one-to-one VC, even though the proposed method requires neither the source speakers' spectra, nor the target speakers' spectra, to be included in the training set.

  • Conference Article
  • Cite Count Icon 5
  • 10.1109/iscslp.2016.7918382
Dictionary update for NMF-based voice conversion using an encoder-decoder network
  • Oct 1, 2016
  • Chin-Cheng Hsu + 4 more

In this paper, we propose a dictionary update method for Non-negative Matrix Factorization (NMF) with high dimensional data in a spectral conversion (SC) task. Voice conversion has been widely studied due to its potential applications such as personalized speech synthesis and speech enhancement. Exemplar-based NMF (ENMF) emerges as an effective and probably the simplest choice among all techniques for SC, as long as a source-target parallel speech corpus is given. ENMF-based SC systems usually need a large amount of bases (exemplars) to ensure the quality of the converted speech. However, a small and effective dictionary is desirable but hard to obtain via dictionary update, in particular when high-dimensional features such as STRAIGHT spectra are used. Therefore, we propose a dictionary update framework for NMF by means of an encoder-decoder reformulation. Regarding NMF as an encoder-decoder network makes it possible to exploit the whole parallel corpus more effectively and efficiently when applied to SC. Our experiments demonstrate significant gains of the proposed system with small dictionaries over conventional ENMF-based systems with dictionaries of same or much larger size.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/waspaa.2015.7336943
Many-to-one voice conversion using exemplar-based sparse representation
  • Oct 1, 2015
  • Ryo Aihara + 2 more

Voice conversion (VC) is being widely researched in the field of speech processing because of increased interest in using such processing in applications such as personalized Text-to-Speech systems. We present in this paper a many-to-one VC method using exemplar-based sparse representation, which is different from conventional statistical VC. In our previous exemplar-based VC method, input speech was represented by the source dictionary and its sparse coefficients. The source and the target dictionaries are fully coupled and the converted voice is constructed from the source coefficients and the target dictionary. This method requires parallel exemplars (which consist of the source exemplars and target exemplars that have the same texts uttered by the source and target speakers) for dictionary construction. In this paper, we propose a many-to-one VC method in an exemplar-based framework which does not need training data of the source speaker. Some statistical approaches for many-to-one VC have been proposed; however, in the framework of exemplar-based VC, such a method has never been proposed. The effectiveness of our many-to-one VC has been confirmed by comparing its effectiveness with that of a conventional one-to-one NMF-based method and one-to-one GMM-based method.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.