Abstract

Spontaneous speech emotion recognition is a new and challenging research topic. In this paper, we propose a new method of spontaneous speech emotion recognition on the basis of binaural representations and deep convolutional neural networks (CNNs). The proposed method initially employs multiple CNNs to learn deep segment-level binaural representations such as Left-Right and Mid-Side pairs from the extracted image-like Mel-spectrograms. These CNNs are fine-tuned on target emotional speech datasets from a pre-trained image CNN model. Then, a new feature pooling strategy, called block-based temporal feature pooling, is proposed to aggregate the learned segment-level features for producing fixed-length utterance-level features. Based on the utterance-level features, linear support vector machines (SVM) is adopted for emotion classification. Finally, a two-stage score-level fusion strategy is used to integrate the obtained results from Left-Right and Mid-Side pairs. Extensive experiments on two challenging spontaneous emotional speech datasets, including the AFEW5.0 and BAUM-1s databases, demonstrate the effectiveness of our proposed method.

Highlights

  • Speech signals are one of the most natural ways of human emotion expression

  • The major contributions of this paper can be summarized as follows: (1) Considering the rich temporal-spatial information of binaural representations, we propose a new Speech emotion recognition (SER) method based on convolutional neural networks (CNNs) and binaural representations

  • EXPERIMENT STUDIES To evaluate the performance of the proposed method on spontaneous SER tasks, we conduct experiments on two challenging spontaneous emotional speech datasets, i.e., the AFEW5.0 [33] and BAUM-1s [34] databases

Read more

Summary

INTRODUCTION

Speech signals are one of the most natural ways of human emotion expression. Speech emotion recognition (SER) has become an important and challenging task in the fields of signal processing, artificial intelligence, pattern recognition, etc, because of its potential applications to human-computer interaction [1]. In our recent work [29], we designed an image-like spectrogram as inputs of deep CNNs like AlexNet to learn high-level segment-level feature representations for SER. Such learned deep spectrum features benefit from the advantages of crossmedia transfer learning, since they are developed by finetuning pre-trained deep CNNs on image classification tasks.

LEARNING SEGMENT-LEVEL FEATURES WITH DEEP CNNs
BLOCK-BASED TEMPORAL FEATURE POOLING
TWO-STAGE SCORE-LEVEL FUSION
DATASETS
SETTINGS
Findings
CONCLUSION AND FUTURE WORK
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call