Abstract

Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.

Highlights

  • Emotion recognition (ER) plays an important role in human-computer interaction (HCI) [1].During the last few years, numerous approaches have been proposed using different modalities [2,3,4,5]

  • There are four primary types of acoustic features that can be extracted from speech signals [6]: (i) Continuous features such as pitch and energy; (ii) features related to aspects of voice quality; (iii) spectral features, such as linear predictive coefficients, mel-frequency cepstral coefficient (MFCC), and log-frequency power coefficient; and (iv) Teager Energy Operator (TEO)-based features, such as the normalized TEO autocorrelation envelope area

  • A group normalization layer is adopted in the visual attention convolutional neural network (VACNN) to avoid overfitting

Read more

Summary

Introduction

Emotion recognition (ER) plays an important role in human-computer interaction (HCI) [1]. During the last few years, numerous approaches have been proposed using different modalities (e.g., speech, facial expressions, and gestures) [2,3,4,5]. Speech is a useful modality in HCI research because of its different strengths, tremors, and speech rates depending on the emotional state. Speech emotion recognition (SER) is a useful research area in many ER challenges in terms of communicating human emotions and for computers using acoustic features. There are two key steps in SER: (i) Extracting the appropriate acoustic features from speech signals in utterances and (ii) identifying the emotional state in the speech signal. There are four primary types of acoustic features that can be extracted from speech signals [6]: (i) Continuous features such as pitch and energy; (ii) features related to aspects of voice quality (e.g., harsh, tense, and breathy);

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call