Abstract

With the growing demand of automatic emotion recognition system, emotion recognition is becoming more and more crucial for human–computer interaction (HCI) research. Recently, there is a continuous improvement in the performance of automatic emotion recognition due to the development of both hardware and deep learning methods. However, because of the abstract concept and multiple expressions of emotion, automatic emotion recognition is still a challenging task. In this paper, we propose a novel Multi-modal Correlated Network for emotion recognition aiming at exploiting the information from both audio and visual channels to achieve more robust and accurate detection. In the proposed method, the audio signals and visual signals are first preprocessed for the feature extraction. After preprocessing, we obtain the Mel-spectrograms, which can be treated as images, and the representative frames from visual segments. Then the Mel-spectrograms are fed to the convolutional neural network (CNN) to get the audio features and the representative frames are fed to the CNN and LSTM to get features. Specially, we employ the triplet loss to increase the differentiation of inter-class. Meanwhile, we propose a novel correlated loss to reduce the differentiation of intra-class. Finally, we apply the feature fusion method to fuse the audio and visual feature for emotion recognition classification. The experimental result on AEFW dataset demonstrates the correlation information of multiple modals is crucial for automatic emotion recognition and the proposed method can achieve the state-of-the-art performance on the classification task.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call