Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

Cheng Lu,Yuan Zong,Wenming Zheng,Suyuan Liu,Chuangao Tang,Chaolong Li,Simeng Yan

doi:10.1145/3242969.3264992

Abstract

The difficulty of emotion recognition in the wild (EmotiW) is how to train a robust model to deal with diverse scenarios and anomalies. The Audio-video Sub-challenge in EmotiW contains audio-video short clips with several emotional labels and the task is to distinguish which label the video belongs to. For the better emotion recognition in videos, we propose a multiple spatio-temporal feature fusion (MSFF) framework, which can more accurately depict emotional information in spatial and temporal dimensions by two mutually complementary sources, including the facial image and audio. The framework is consisted of two parts: the facial image model and the audio model. With respect to the facial image model, three different architectures of spatial-temporal neural networks are employed to extract discriminative features about different emotions in facial expression images. Firstly, the high-level spatial features are obtained by the pre-trained convolutional neural networks (CNN), including VGG-Face and ResNet-50 which are all fed with the images generated by each video. Then, the features of all frames are sequentially input to the Bi-directional Long Short-Term Memory (BLSTM) so as to capture dynamic variations of facial appearance textures in a video. In addition to the structure of CNN-RNN, another spatio-temporal network, namely deep 3-Dimensional Convolutional Neural Networks (3D CNN) by extending the 2D convolution kernel to 3D, is also applied to attain evolving emotional information encoded in multiple adjacent frames. For the audio model, the spectrogram images of speech generated by preprocessing audio, are also modeled in a VGG-BLSTM framework to characterize the affective fluctuation more efficiently. Finally, a fusion strategy with the score matrices of different spatio-temporal networks gained from the above framework is proposed to boost the performance of emotion recognition complementally. Extensive experiments show that the overall accuracy of our proposed MSFF is 60.64%, which achieves a large improvement compared with the baseline and outperform the result of champion team in 2017.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild
Stefano Pini ... Benoit Huet
-
Stefano Pini, et. al.Stefano Pini ... Benoit Huet
03 Nov 2017
03 Nov 2017

Recurrent Neural Networks for Emotion Recognition in Video
Samira Ebrahimi Kahou ... Christopher Pal
-
Samira Ebrahimi Kahou, et. al.Samira Ebrahimi Kahou ... Christopher Pal
09 Nov 2015
09 Nov 2015

Video Emotion Recognition using Hand-Crafted and Deep Learning Features
Xiaohan Xia ... Wenjing Han
-
Xiaohan Xia, et. al.Xiaohan Xia ... Wenjing Han
01 May 2018
01 May 2018

An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos
Sicheng Zhao ... Runbo Hu
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34
Sicheng Zhao, et. al.Sicheng Zhao ... Runbo Hu
03 Apr 2020
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

Abstract

Talk to us

Similar Papers