Abstract

We propose a hybrid network-based learning framework for speaker-adaptive vocal emotion conversion, tested on three different datasets (languages), namely, EmoDB (German), IITKGP (Telugu), and SAVEE (English). The optimized learning model introduced is unique because of its ability to synthesize emotional speech with an acceptable perceptive quality while preserving speaker characteristics. The multilingual model is extremely beneficial in scenarios wherein emotional training data from a specific target speaker are sparsely available. The proposed model uses speaker-normalized mel-generalized cepstral coefficients for spectral training with data adaptation using the seed data from the target speaker. The fundamental frequency (F0) is transformed using a wavelet synchrosqueezed transform prior to mapping to obtain a sharpened time-frequency representation. Moreover, a feedforward artificial neural network, together with particle swarm optimization, was used for F0 training. Additionally, static-intensity modification was also performed for each test utterance. Using the framework, we were able to capture the spectral and pitch contour variabilities of emotional expression better than with other state-of-the-art methods used in this study. Considering the overall performance scores across datasets, an average melcepstral distortion (MCD) of 4.98 and root mean square error (RMSE-F0) of 10.67 were obtained in objective evaluations, and an average comparative mean opinion score (CMOS) of 3.57 and speaker similarity score of 3.70 were obtained for the proposed framework. Particularly, the best MCD of 4.09 (EmoDB-happiness) and RMSE-F0 of 9.00 (EmoDB-anger) were obtained, along with the maximum CMOS of 3.7 and speaker similarity of 4.6, thereby highlighting the effectiveness of the hybrid network model.

Highlights

  • Emotions form a salient aspect of human communication via various modalities

  • This study aims to address some of the aforementioned challenges by proposing an adaptation model for emotion conversion; the model was tested on three languages: German, Telugu, and English

  • Multi-layer perceptron (MLP) deep neural networks (DNNs) are typical deep-learning models, and they are used for function approximation, with a deeper architecture and regularization compared to artificial neural networks (ANNs)

Read more

Summary

INTRODUCTION

Emotions form a salient aspect of human communication via various modalities. speech is the most accessible data, as it contains various cues such as linguistic information, gender and identity of the speaker, and emotion. An emotion-synthesis framework, essentially, modifies the spectral and prosodic components extracted from the neutral speech to those of the target emotion via parameter learning, generally using aligned parallel data. Modern emotion-conversion schemes are generally data driven, whereas speech-based features must be trained using machine-learning approaches. The continuous control of emotional intensity was accomplished by using the combination of a three-layer adaptive neuro-fuzzy inference system and a Fujisaki model for extracting the F0 contour [53]; the method was tested for monolingual data in the Japanese language. Efforts have been invested in integrating an emotion-conversion module into the convolutional neural network based TTS to improve its naturalism All these models are data intensive and introduce linguistic. The framework, which used sparse training data from the target speaker for mapping, achieved appreciable perceptive quality of the synthesized speech. The following section describes the mathematical theory applied in the proposed framework

MATHEMATICAL BACKGROUND
PARTICLE SWARM OPTIMIZATION
TESTING AND RE-SYNTHESIS
EXPERIMENTAL DATA SOURCE
RESULTS AND DISCUSSION
SUBJECTIVE EVALUATION AND ANALYSIS
VIII. CONCLUSION AND FUTURE SCOPE
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.