Voice Conversion Model Research Articles

<abstract><p>Singing voice conversion methods encounter challenges in achieving a delicate balance between synthesis quality and singer similarity. Traditional voice conversion techniques primarily emphasize singer similarity, often leading to robotic-sounding singing voices. Deep learning-based singing voice conversion techniques, however, focus on disentangling singer-dependent and singer-independent features. While this approach can enhance the quality of synthesized singing voices, many voice conversion systems still grapple with the issue of singer-dependent feature leakage into content embeddings. In the proposed singing voice conversion technique, an encoder decoder framework was implemented using a hybrid model of convolutional neural network (CNN) accompanied by long short term memory (LSTM). This paper investigated the use of activation guidance and adaptive instance normalization techniques for one shot singing voice conversion. The instance normalization (IN) layers within the auto-encoder effectively separated singer and content representations. During conversion, singer representations were transferred using adaptive instance normalization (AdaIN) layers. This singing voice system with the help of activation function prevented the transfer of singer information while conveying the singing content. Additionally, the fusion of LSTM with CNN can enhance voice conversion models by capturing both local and contextual features. The one-shot capability simplified the architecture, utilizing a single encoder and decoder. Impressively, the proposed hybrid CNN-LSTM model achieved remarkable performance without compromising either quality or similarity. The objective and subjective evaluation assessments showed that the proposed hybrid CNN-LSTM model outperformed the baseline architectures. Evaluation results showed a mean opinion score (MOS) of 2.93 for naturalness and 3.35 for melodic similarity. These hybrid CNN-LSTM techniques allowed it to perform high-quality voice conversion with minimal training data, making it a promising solution for various applications.</p></abstract>

Read full abstract

Emotional voice conversion (EVC) is a task that converts an utterance’s emotional features into a target one while retaining semantic information and speaker identity. Recently, some researchers leverage deep learning methods to improve the performance of EVC, such as deep neural network (DNN), sequence-to-sequence model (seq2seq), long-short-term memory network (LSTM), and convolutional neural network (CNN), as well as their combinations with an attention mechanism. However, their methods always suffer from some instability problems (e.g., mispronunciations and skipped phonemes) because these models fail to capture temporal intra-relationships among a wide range of frames, resulting in unnatural speech and discontinuous emotional expression. To enhance the ability to capture intra-relations among frames by augmenting the temporal dependency of models, we explored the power of a transformer in this study. Specifically, we proposed a CycleGAN-based model with the transformer and investigated its ability in the EVC task. In the training procedure, we adopted curriculum learning to gradually increase the frame length to ensure that the model can monitor from short segments throughout the entire speech. The proposed method was evaluated on a Japanese emotional speech dataset and Emotional Speech Dataset (ESD, contains English and Chinese speech). Then, it was compared to widely used EVC baselines (ACVAE, CycleGAN) involving objective and subjective evaluations. The results indicate that our proposed model can convert emotion with higher emotional similarity, quality, and naturalness. • A CycleTransGAN is proposed to improve its performance on the emotional voice conversion (EVC) task. • Curriculum learning was adopted to gradually increase the input length during training. • A fine-grained level discriminator was designed to enhance the model’s ability to convert emotional voices. • The proposed method was evaluated on a Japanese emotional speech dataset and Emotional Speech Dataset (ESD, containing English and Chinese speech). • The transformer enhanced the model’s temporal dependency with a wider range, which improved the quality of converted speech.

Read full abstract

Voice Conversion Model Research Articles

Related Topics

Articles published on Voice Conversion Model

CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion

SVCGAN: Speaker Voice Conversion Generative Adversarial Network for Children’s Speech Conversion and Recognition

DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion

Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion

A noise-robust voice conversion method with controllable background sounds

Noise-robust voice conversion using adversarial training with multi-feature decoupling

A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion

Speaker embedding space cosine similarity comparisons of singing voice conversion models and voice morphing

Disentangling Content Information by Combining ASR and TTS Bottleneck Features for Voice Conversion

STYLETTS-VC: ONE-SHOT VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM STYLE-BASED TTS MODELS.

Decoupling Speaker-Independent Emotions for Voice Conversion via Source-Filter Networks

GLGAN-VC: A Guided Loss-Based Generative Adversarial Network for Many-to-Many Voice Conversion.

Non-Parallel Whisper-to-Normal Speaking Style Conversion Using Auxiliary Classifier Variational Autoencoder

Electrolaryngeal speech enhancement based on a two stage framework with bottleneck feature refinement and voice conversion

An improved CycleGAN-based emotional voice conversion model by augmenting temporal dependency with a transformer

Parallel voice conversion with limited training data using stochastic variational deep kernel learning

One-shot emotional voice conversion based on feature separation

Cross-lingual multi-speaker speech synthesis with limited bilingual training data

A Preliminary Study on Realizing Human-Robot Mental Comforting Dialogue via Sharing Experience Emotionally.

Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Voice Conversion Model Research Articles

Related Topics

Articles published on Voice Conversion Model

CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion

SVCGAN: Speaker Voice Conversion Generative Adversarial Network for Children’s Speech Conversion and Recognition

DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion

Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion

A noise-robust voice conversion method with controllable background sounds

Noise-robust voice conversion using adversarial training with multi-feature decoupling

A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion

Speaker embedding space cosine similarity comparisons of singing voice conversion models and voice morphing

Disentangling Content Information by Combining ASR and TTS Bottleneck Features for Voice Conversion

STYLETTS-VC: ONE-SHOT VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM STYLE-BASED TTS MODELS.

Decoupling Speaker-Independent Emotions for Voice Conversion via Source-Filter Networks

GLGAN-VC: A Guided Loss-Based Generative Adversarial Network for Many-to-Many Voice Conversion.

Non-Parallel Whisper-to-Normal Speaking Style Conversion Using Auxiliary Classifier Variational Autoencoder

Electrolaryngeal speech enhancement based on a two stage framework with bottleneck feature refinement and voice conversion

An improved CycleGAN-based emotional voice conversion model by augmenting temporal dependency with a transformer

Parallel voice conversion with limited training data using stochastic variational deep kernel learning

One-shot emotional voice conversion based on feature separation

Cross-lingual multi-speaker speech synthesis with limited bilingual training data

A Preliminary Study on Realizing Human-Robot Mental Comforting Dialogue via Sharing Experience Emotionally.

Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion