Abstract

In the domain of speech emotion recognition (SER), generally there is an unbalanced data distribution of emotional samples in existing emotional datasets. Moreover, different fragment areas in an utterance contribute diversely to SER. To address these two issues, this paper proposes a new SER method by combining a unified first-order attention network with data balance. The proposed method firstly utilizes the strategy of data balance to augment and balance the training data. Then, a pre-trained convolutional neural network (CNN) model (i.e., VGGish) is fine-tuned on target emotional datasets to learn segment-level speech features from the extracted Log Mel-spectrograms. Next, the unified first-order attention mechanism, including different feature-pooling strategies such as sum, min, max, mean, and standard deviation (std), is embedded into the output of a bi-directional long short-term memory (Bi-LSTM) network. This is used for learning high-level discriminative segment-level features, and simultaneously aggregating the learned segment-level features into fixed-length utterance-level features for SER. Finally, based on utterance-level features, the softmax layer in a Bi-LSTM network is adopted to conduct final emotion classification task. Extensive experiments are implemented on three public datasets such as BAUM-1s, AFEW5.0, and CHEAVD2.0, demonstrate the advantage of the proposed method.

Highlights

  • Automatic speech emotion recognition (SER) has drawn extensive attentions in the areas of speech signal processing, pattern recognition, affective computing and so on

  • Inspired by the above-mentioned advantages of the attention mechanism and data balance, this paper proposes a new method of SER that combines data balance with a unified first-order attention network, including sum, min, max, mean, standard deviation

  • The number of speech Melspectrogram segments for each emotion becomes similar, and the number of less samples increases to some extent

Read more

Summary

Introduction

Automatic speech emotion recognition (SER) has drawn extensive attentions in the areas of speech signal processing, pattern recognition, affective computing and so on. This is attributed to the fact that automatic SER can be used in human-computer interaction [1, 2], smart home [3], smart healthcare [4], robots [5], real-time translation tools [6], etc. Most prior works focus on learning hand-crafted acoustic features, which are fed into conventional classifiers for final emotion classification task. Nwe et al [11] adopted short time log frequency power coefficients (LFPC) as spectral features, and employed hidden Markov models (HMM) as a classifier for SER. Schuller et al [15]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call