Voice Conversion Research Articles

<abstract><p>Singing voice conversion methods encounter challenges in achieving a delicate balance between synthesis quality and singer similarity. Traditional voice conversion techniques primarily emphasize singer similarity, often leading to robotic-sounding singing voices. Deep learning-based singing voice conversion techniques, however, focus on disentangling singer-dependent and singer-independent features. While this approach can enhance the quality of synthesized singing voices, many voice conversion systems still grapple with the issue of singer-dependent feature leakage into content embeddings. In the proposed singing voice conversion technique, an encoder decoder framework was implemented using a hybrid model of convolutional neural network (CNN) accompanied by long short term memory (LSTM). This paper investigated the use of activation guidance and adaptive instance normalization techniques for one shot singing voice conversion. The instance normalization (IN) layers within the auto-encoder effectively separated singer and content representations. During conversion, singer representations were transferred using adaptive instance normalization (AdaIN) layers. This singing voice system with the help of activation function prevented the transfer of singer information while conveying the singing content. Additionally, the fusion of LSTM with CNN can enhance voice conversion models by capturing both local and contextual features. The one-shot capability simplified the architecture, utilizing a single encoder and decoder. Impressively, the proposed hybrid CNN-LSTM model achieved remarkable performance without compromising either quality or similarity. The objective and subjective evaluation assessments showed that the proposed hybrid CNN-LSTM model outperformed the baseline architectures. Evaluation results showed a mean opinion score (MOS) of 2.93 for naturalness and 3.35 for melodic similarity. These hybrid CNN-LSTM techniques allowed it to perform high-quality voice conversion with minimal training data, making it a promising solution for various applications.</p></abstract>

Read full abstract

The Multilingual Voice-Based Image Caption Generator (MVBICG) is a versatile tool with numerous applications spanning communications, culture preservation, business, and technology, making it indispensable in the interconnected world. The task of image caption generation combines computer vision and NLP (natural language processing) concepts, enabling the system to understand the details or complexities of the image context and describe them in natural language. Image descriptions serve as an invaluable solution for visually impaired individuals. The MVBICG system is designed to provide real-time image descriptions in the form of voice in multiple languages as per user requirements. With the use of an MVBICG, the descriptions can be obtained as a voice output in different languages. Converting a voice into multiple languages with the help of the Google Translate API is often referred to as “multilingual voice conversion” or “multilingual speech synthesis." It leverages the latest advancements in deep learning, particularly convolutional neural networks (CNNs) for image feature extraction and recurrent neural networks (RNNs) with attention mechanisms for natural language generation. In the future, image processing is expected to take center stage as a critical research domain primarily dedicated to the preservation and protection of human lives. The MVBICG demonstrates remarkable performance with BLEU scores of 0.483601 for BLEU-1 and 0.320112 for BLEU-2, indicating its proficiency in generating precise and contextually relevant image captions. These scores further underscore its value in bridging language barriers and enhancing accessibility, highlighting its potential for broader societal impact. Additionally, the system's training progress is illustrated by a loss plot, showing the convergence of the model over time. As image processing continues to advance, the MVBICG emerges as a pivotal research domain, focusing on the preservation and safeguarding of human lives through advanced technologies.

Read full abstract

Voice Conversion Research Articles

Articles published on Voice Conversion

A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion

Active Defense Against Voice Conversion Through Generative Adversarial Network

Enhancing Cross-Linguistic Image Caption Generation with Indian Multilingual Voice Interfaces using Deep Learning Techniques

MetricCycleGAN-VC: Forcing CycleGAN-Based Voice Conversion Systems to Associate With Objective Quality Metrics

A Comparative Evaluation on Data Transformation Approach for Artificial Speech Detection

RefXVC: Cross-Lingual Voice Conversion with Enhanced Reference Leveraging

A Pitch-Controlled End-to-End Voice Conversion System for Brazilian Portuguese

Fast end-to-end non-parallel voice conversion based on speaker-adaptive neural vocoder with cycle-consistent learning

A novel communication system based on sign language recognition and voice conversion for differently abled person

StreamVoice+: Evolving Into End-to-End Streaming Zero-Shot Voice Conversion

Sign Recognition and Voice Conversion Device for Dumb

Complementary regional energy features for spoofed speech detection

Method of interacting between humans and conversational voice agent systems

Robust Spoofed Speech Detection with Denoised I-vectors

Choosing only the best voice imitators: Top-K many-to-many voice conversion with StarGAN

Huawei Smartphone Marketing Case Analysis

Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder

W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision

Data Augmentation based Cross-Lingual Multi-Speaker TTS using DL with Sentiment Analysis

Voice spoofing detection for multiclass attack classification using deep learning

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Voice Conversion Research Articles

Articles published on Voice Conversion

A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion

Active Defense Against Voice Conversion Through Generative Adversarial Network

Enhancing Cross-Linguistic Image Caption Generation with Indian Multilingual Voice Interfaces using Deep Learning Techniques

MetricCycleGAN-VC: Forcing CycleGAN-Based Voice Conversion Systems to Associate With Objective Quality Metrics

A Comparative Evaluation on Data Transformation Approach for Artificial Speech Detection

RefXVC: Cross-Lingual Voice Conversion with Enhanced Reference Leveraging

A Pitch-Controlled End-to-End Voice Conversion System for Brazilian Portuguese

Fast end-to-end non-parallel voice conversion based on speaker-adaptive neural vocoder with cycle-consistent learning

A novel communication system based on sign language recognition and voice conversion for differently abled person

StreamVoice+: Evolving Into End-to-End Streaming Zero-Shot Voice Conversion

Sign Recognition and Voice Conversion Device for Dumb

Complementary regional energy features for spoofed speech detection

Method of interacting between humans and conversational voice agent systems

Robust Spoofed Speech Detection with Denoised I-vectors

Choosing only the best voice imitators: Top-K many-to-many voice conversion with StarGAN

Huawei Smartphone Marketing Case Analysis

Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder

W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision

Data Augmentation based Cross-Lingual Multi-Speaker TTS using DL with Sentiment Analysis

Voice spoofing detection for multiclass attack classification using deep learning