Source Speaker Research Articles

Research in the area of automatic monitoring of emotional state from speech permits envisaging future novel applications for the remote monitoring of some common mental disorders, such as depression. However, these tools raise some privacy concerns since speech is sent via telephone or the Internet, and it is further stored or processed in remote servers. Speaker de-identification can be used to protect the privacy of these patients, but this procedure might affect the ability to perceive the disease when using automatic depression detection approaches. It is also important that the resulting de-identified speech has enough quality since practitioners may need to listen to the recordings to assess the patients’ state. This paper performs an extensive analysis of depression detection from de-identified speech using different de-identification approaches based on voice conversion. In previous work, a de-identification technique based on pretrained transformation functions was assessed in the context of depression detection. That strategy is speaker-independent (i.e. not speaker-specific) and gender-independent (i.e. the gender of the speaker is not necessarily preserved), which makes it possible to implement it in a real-world scenario where no parallel training data is required between input and source speakers. This paper aims at analyzing different aspects of the aforementioned speaker de-identification approach in a depression detection scenario: 1) compare the performance of the proposed speaker-independent technique with a speaker-dependent setting where parallel data between input and source speaker are available; 2) analyze how this system behaves when the gender of the speaker is preserved, since this might be a desirable feature and has not been addressed in previous work; 3) assess the performance of two different voice conversion methods in a setting where a limited amount of training data is available; specifically de-identification based on frequency warping and amplitude scaling (FW+AS) was compared with a strategy based on generative adversarial networks (GAN). Experimental validation was carried out in the framework of the Audio/Visual Emotion Challenge 2014, and the results suggest that speaker-independent and gender-dependent de-identification is the most suitable option for depression level estimation since the trade-off between de-identification and depression estimation performances was superior to the other alternatives. In addition, the results suggest that the de-identification approach based on GAN achieves better de-identification performance than FW+AS while achieving comparable results for depression detection.

Read full abstract

Voice Conversion (VC) is a method of converting the source speaker's speech into the target speaker's speech without changing the source speaker's speech content. The current VC methods have the following problems: (1) they are only applicable to a limited number of speakers, not to any speakers, as a result, the application scenarios are greatly restricted; (2) the representation (feature) separation(RS) effect of the current mainstream technology is not ideal on the source speaker speech and the target speaker speech; and (3) the voice conversion quality of most models is unsatisfactory, and hence needs to be improved. Therefore, in this paper, we constructed a one-shot VC model of Representation Separation, called RS-VC model, implemented by the encoder-decoder structure. The encoder is composed of a content encoder and a speaker encoder. The content encoder separates the content information of the source speaker voice and generates a content representation. The speaker encoder separates the target speaker information of the target speaker voice and generates a speaker representation. The decoder synthesizes the content representation and the speaker representation to generate the converted voice. In this paper, we obtained the optimized speaker verification model SVIGEN2E (Speaker Verification with Instance Normalization using Generalized End-to-End loss) by improving the speaker verification (SV) model. The model SVIGEN2E is used as the speaker encoder. This speaker encoder needs to be trained in advance prior to RS-VC model training, and the pre-trained model of SVINGE2E directly extracts speaker representation of the target speaker's voice, and is used for training and testing RS-VC model. A progressive training method is proposed then for training RS-VC model. Experiments show that the progressive training method can effectively improve the quality of the converted voice. Compared with the basic speaker verification model, both SVINGE2E and RS-VC deliver the impressive improvements in EER (Equal Error Rate).

Read full abstract

Source Speaker Research Articles

Related Topics

Articles published on Source Speaker

CBFMCycleGAN-VC: Using the Improved MaskCycleGAN-VC to Effectively Predict a Person’s Voice After Aging

Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis

U2-VC: one-shot voice conversion using two-level nested U-structure

Data augmentation based non-parallel voice conversion with frame-level speaker disentangler

Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial Learning

Voice Transformation Using Two-Level Dynamic Warping and Neural Networks

Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion

Converting Foreign Accent Speech Without a Reference

Morphological generalization of Hebrew verb classes

When Automatic Voice Disguise Meets Automatic Speaker Verification

Analysis of gender and identity issues in depression detection on de-identified speech

Vowels and Prosody Contribution in Neural Network Based Voice Conversion Algorithm with Noisy Training Data

Vowels and Prosody Contribution in Neural Network Based Voice Conversion Algorithm with Noisy Training Data

One-Shot Voice Conversion Algorithm Based on Representations Separation

Effective Emotion Transplantation in an End-to-End Text-to-Speech System

Multi-Task WaveRNN With an Integrated Architecture for Cross-Lingual Voice Conversion

NAUTILUS: A Versatile Voice Cloning System

Speaker Anonymization for Personal Information Protection Using Voice Conversion Techniques

ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion

Voice Conversion for Persons with Amyotrophic Lateral Sclerosis.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Source Speaker Research Articles

Related Topics

Articles published on Source Speaker

CBFMCycleGAN-VC: Using the Improved MaskCycleGAN-VC to Effectively Predict a Person’s Voice After Aging

Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis

U2-VC: one-shot voice conversion using two-level nested U-structure

Data augmentation based non-parallel voice conversion with frame-level speaker disentangler

Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial Learning

Voice Transformation Using Two-Level Dynamic Warping and Neural Networks

Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion

Converting Foreign Accent Speech Without a Reference

Morphological generalization of Hebrew verb classes

When Automatic Voice Disguise Meets Automatic Speaker Verification

Analysis of gender and identity issues in depression detection on de-identified speech

Vowels and Prosody Contribution in Neural Network Based Voice Conversion Algorithm with Noisy Training Data

Vowels and Prosody Contribution in Neural Network Based Voice Conversion Algorithm with Noisy Training Data

One-Shot Voice Conversion Algorithm Based on Representations Separation

Effective Emotion Transplantation in an End-to-End Text-to-Speech System

Multi-Task WaveRNN With an Integrated Architecture for Cross-Lingual Voice Conversion

NAUTILUS: A Versatile Voice Cloning System

Speaker Anonymization for Personal Information Protection Using Voice Conversion Techniques

ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion

Voice Conversion for Persons with Amyotrophic Lateral Sclerosis.