Abstract
Speech Emotion Recognition (SER) performance degrades when their training and test conditions or corpora differ. Cross-corpus SER (CCSER) is a research branch that discusses adapting an SER system to identify speech emotions on a corpus that has different recording conditions or language from the training corpus. For CCSER, adaption can be performed in the feature extraction module or emotion classifier, which are the two main components of the SER system. In this paper, we propose AFTL method (attention-based feature fusion along with transfer learning), including methods in both feature extraction and classification for CCSER. In the feature extraction part, we use Wav2Vec 2.0 transformer blocks and prosody features, and we propose an attention method for fusing them. In the classifier part, we use transfer learning for transferring the knowledge of a model trained on source emotional speech corpus to recognize emotions on a target corpus. We performed experiments on numerous speech emotional datasets as target corpora, where we used IEMOCAP as the source corpus. For instance, we achieve 92.45% accuracy on the EmoDB dataset, where we only use 20% of speakers for adapting the source model. In addition, for other target corpora, we obtained admissible results.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.