EMO-DB Dataset Research Articles

Speech emotion recognition (SER) is an exciting topic in the field of human-machine interaction. Several handcrafted features are used for SER. However, determining these features is both a difficult and time-consuming process. Instead, the use of features generated by convolutional neural networks (CNNs) with spectrograms and Mel-spectrograms has gained momentum in recent years. These CNNs are widely employed in image applications. Therefore, the audio signals must be represented in the best way as images. The spectrogram presents evenly spaced frequency components. However, spectral energy when mostly at low frequencies is not desirable. The Mel-filter provides benefits, but several studies have shown that its performance is inferior to biologically inspired models. In addition, the high variance between features negatively affects its classification performance. In this study, log-power rate map features are suggested as an auditory model for the SER task. In addition, we have proposed the use of a threshold function to focus on regions with high spectral energy. A rate map provides better resolution in the low-frequency region. In addition, smoothing reduces the variance between features, and focusing on spectral peaks reduces the effect of user-dependent features. The proposed approach was tested, independent of subject and gender, on the EMO-DB and EMOVO datasets, which are widely used in the literature. In the EMO-DB dataset, an increase of 2.42 % was achieved, with a classification performance of 91.32 %. In the EMOVO dataset, an increase of 4.95 % was achieved, with a classification performance of 68.93 %.

Read full abstract

The role of automatic emotion recognition from speech is growing continuously because of the accepted importance of reacting to the emotional state of the user in human–computer interaction. Most state-of-the-art emotion recognition methods are based on turn- and frame-level analysis independent from phonetic transcription. Here, we are interested in a phoneme-based classification of the level of arousal in acted and spontaneous emotions. To start, we show that our previously published classification technique which showed high-level results in the Interspeech 2009 Emotion Challenge cannot provide sufficiently good classification in cross-corpora evaluation (a condition close to real-life applications). To prove the robustness of our emotion classification techniques we use cross-corpora evaluation for a simplified two-class problem; namely high and low arousal emotions. We use emotion classes on a phoneme-level for classification. We build our speaker-independent emotion classifier with HMMs, using GMMs-based production probabilities and MFCC features. This classifier performs equally well when using a complete phoneme set, as it does in the case of a reduced set of indicative vowels (7 out of 39 phonemes in the German SAM-PA list). Afterwards we compare emotion classification performance of the technique used in the Emotion Challenge with phoneme-based classification within the same experimental setup. With phoneme-level emotion classes we increase cross-corpora classification performance by about 3.15% absolute (4.69% relative) for models trained on acted emotions (EMO-DB dataset) and evaluated on spontaneous emotions (VAM dataset); within vice versa experimental conditions (trained on VAM, tested on EMO-DB) we obtain 15.43% absolute (23.20% relative) improvement. We show that using phoneme-level emotion classes can improve classification performance even with comparably low speech recognition performance obtained with scant a priori knowledge about the language, implemented as a zero-gram for word-level modeling and a bi-gram for phoneme-level modeling. Finally we compare our results with the state-of-the-art cross-corpora evaluations on the VAM database. For training our models, we use an almost 15 times smaller training set, consisting of 456 utterances (210 low and 246 high arousal emotions) instead of 6820 utterances (4685 high and 2135 low arousal emotions). We are yet able to increase cross-corpora classification performance by about 2.25% absolute (3.22% relative) from UA=69.7% obtained by Zhang et al. to UA=71.95%.

Read full abstract

EMO-DB Dataset Research Articles

Articles published on EMO-DB Dataset

Multi-Layer Hybrid Fuzzy Classification Based on SVM and Improved PSO for Speech Emotion Recognition

The Impact of Attention Mechanisms on Speech Emotion Recognition.

An Analysis of the Impact of Spectral Contrast Feature in Speech Emotion Recognition

Pseudo-colored rate map representation for speech emotion recognition

Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

EMO-DB Dataset Research Articles

Articles published on EMO-DB Dataset

Multi-Layer Hybrid Fuzzy Classification Based on SVM and Improved PSO for Speech Emotion Recognition

The Impact of Attention Mechanisms on Speech Emotion Recognition.

An Analysis of the Impact of Spectral Contrast Feature in Speech Emotion Recognition

Pseudo-colored rate map representation for speech emotion recognition

Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications