Abstract

As emotions play a central role in human communication, automatic emotion recognition has attracted increasing attention in the last two decades. While multimodal systems enjoy high performances on lab-controlled data, they are still far from providing ecological validity on non-lab-controlled, namely “in-the-wild” data. This work investigates audiovisual deep learning approaches to emotion recognition in in-the-wild problem. Inspired by the outstanding performance of end-to-end and transfer learning techniques, we explored the effectiveness of architectures in which a modality-specific Convolutional Neural Network (CNN) is followed by a Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) using the AffWild2 dataset under the Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. We deployed unimodal end-to-end and transfer learning approaches within a multimodal fusion system, which generated final predictions using a weighted score fusion scheme. Exploiting the proposed deep-learning-based multimodal system, we reached a test set challenge performance measure of 48.1% on the ABAW 2020 Facial Expressions challenge, which advances the first-runner-up performance.

Highlights

  • Emotions play a vital role in daily human–human interactions [1]

  • Different techniques for temporal aggregation may be used. We focused on such methods as Support Vector Machines (SVMs) and LSTM applied to the Convolutional Neural Network (CNN) embeddings

  • This article investigated the efficacy of deep learning models in the in-the-wild audiovisual emotion recognition domain

Read more

Summary

Introduction

Automated recognition of emotions from multimodal signals has attracted increasing attention in the last two decades with applications in domains ranging from intelligent call centers [2,3] to intelligent tutoring systems [4,5,6]. Emotion recognition is studied in the broader affective computing field, where the research of natural emotions is the focal point. Research in this domain is shifting to “in-the-wild” conditions, namely away from lab-controlled studies. This is due to the availability of new and challenging datasets collected and introduced in competitions such as Affective Facial Expressions in-the-Wild (AFEW) [7,8] and Affective. Considering the challenging nature of the data, e.g., background noise in audio, cluttered background, and pose variations in video, benefiting from multiple modalities including, but not limited to acoustics, vision (face and body pose), physiological signals, and linguistics is essential [14]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.