Combining modality specific deep neural networks for emotion recognition in video

Samira Ebrahimi Kahou,David Warde-Farley,Emmanuel Bengio,Kishore Reddy Konda,Aaron Courville,Raul Chandias Ferrari,Roland Memisevic,Nicolas Boulanger-Lewandowski,Pascal Vincent,Atousa Torabi,Jean-Philippe Raymond,Pascal Lamblin,Myriam Côté,Arjun Sharma,Christopher Pal,Zhenzhou Wu,Guillaume Desjardins,Xavier Bouthillier,Pierre-Luc Carrier,Mehdi Mirza,Razvan Pascanu,Sébastien Jean,Pierre Froumenty,Yoshua Bengio,Çağlar Gülçehre ,Jérémie Zumer ,Ankur Aggarwal ,Yann N Dauphin

doi:10.1145/2522848.2531745

Abstract

In this paper we present the techniques used for the University of Montreal's team submissions to the 2013 Emotion Recognition in the Wild Challenge. The challenge is to classify the emotions expressed by the primary human subject in short video clips extracted from feature length movies. This involves the analysis of video clips of acted scenes lasting approximately one-two seconds, including the audio track which may contain human voices as well as background music. Our approach combines multiple deep neural networks for different data modalities, including: (1) a deep convolutional neural network for the analysis of facial expressions within video frames; (2) a deep belief net to capture audio information; (3) a deep autoencoder to model the spatio-temporal information produced by the human actions depicted within the entire scene; and (4) a shallow network architecture focused on extracted features of the mouth of the primary human subject in the scene. We discuss each of these techniques, their performance characteristics and different strategies to aggregate their predictions. Our best single model was a convolutional neural network trained to predict emotions from static frames using two large data sets, the Toronto Face Database and our own set of faces images harvested from Google image search, followed by a per frame aggregation strategy that used the challenge training data. This yielded a test set accuracy of 35.58%. Using our best strategy for aggregating our top performing models into a single predictor we were able to produce an accuracy of 41.03% on the challenge test set. These compare favorably to the challenge baseline test set accuracy of 27.56%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Combining modality specific deep neural networks for emotion recognition in video

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Recurrent Neural Networks for Emotion Recognition in Video
Samira Ebrahimi Kahou ... Christopher Pal
-
Samira Ebrahimi Kahou, et. al.Samira Ebrahimi Kahou ... Christopher Pal
09 Nov 2015
09 Nov 2015

Long short term memory recurrent neural network based encoding method for emotion recognition in video
Linlin Chao ... Jianhua Tao
-
Linlin Chao, et. al.Linlin Chao ... Jianhua Tao
01 Mar 2016
01 Mar 2016

Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild
Cheng Lu ... Simeng Yan
-
Cheng Lu, et. al.Cheng Lu ... Simeng Yan
02 Oct 2018
02 Oct 2018

Multi-user facial emotion recognition in video based on user-dependent neural network adaptation
Egor Churaev ... Andrey V Savchenko
-
Egor Churaev, et. al.Egor Churaev ... Andrey V Savchenko
23 May 2022
23 May 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Combining modality specific deep neural networks for emotion recognition in video

Abstract

Talk to us

Similar Papers