Joint modelling of audio-visual cues using attention mechanisms for emotion recognition

Esam Ghaleb,Stylianos Asteriadis,Jan Niehues

doi:10.1007/s11042-022-13557-w

Esam Ghaleb, Stylianos Asteriadis + Show 1 more

Open Access

https://doi.org/10.1007/s11042-022-13557-w

Copy DOI

Journal: Multimedia Tools and Applications	Publication Date: Aug 5, 2022
Citations: 10	License type: open-access

Affiliation: Maastricht University

Abstract

Emotions play a crucial role in human-human communications with complex socio-psychological nature. In order to enhance emotion communication in human-computer interaction, this paper studies emotion recognition from audio and visual signals in video clips, utilizing facial expressions and vocal utterances. Thereby, the study aims to exploit temporal information of audio-visual cues and detect their informative time segments. Attention mechanisms are used to exploit the importance of each modality over time. We propose a novel framework that consists of bi-modal time windows spanning short video clips labeled with discrete emotions. The framework employs two networks, with each one being dedicated to one modality. As input to a modality-specific network, we consider a time-dependent signal deriving from the embeddings of the video and audio modalities. We employ the encoder part of the Transformer on the visual embeddings and another one on the audio embeddings. The research in this paper introduces detailed studies and meta-analysis findings, linking the outputs of our proposition to research from psychology. Specifically, it presents a framework to understand underlying principles of emotion recognition as functions of three separate setups in terms of modalities: audio only, video only, and the fusion of audio and video. Experimental results on two datasets show that the proposed framework achieves improved accuracy in emotion recognition, compared to state-of-the-art techniques and baseline methods not using attention mechanisms. The proposed method improves the results over baseline methods by at least 5.4%. Our experiments show that attention mechanisms reduce the gap between the entropies of unimodal predictions, which increases the bimodal predictions’ certainty and, therefore, improves the bimodal recognition rates. Furthermore, evaluations with noisy data in different scenarios are presented during the training and testing processes to check the framework’s consistency and the attention mechanism’s behavior. The results demonstrate that attention mechanisms increase the framework’s robustness when exposed to similar conditions during the training and the testing phases. Finally, we present comprehensive evaluations of emotion recognition as a function of time. The study shows that the middle time segments of a video clip are essential in the case of using audio modality. However, in the case of video modality, the importance of time windows is distributed equally.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Joint modelling of audio-visual cues using attention mechanisms for emotion recognition

Abstract

Talk to us

Similar Papers

More From: Multimedia Tools and Applications

Lead the way for us

Similar Papers

Multimodal speaker/speech recognition using lip motion, lip texture and audio
H.E Çetingül ... A.M Tekalp
Signal Processing | VOL. 86
H.E Çetingül, et. al.H.E Çetingül ... A.M Tekalp
02 Jun 2006
Signal Processing | VOL. 86

Attitudes and Folk Theories of Data Subjects on Transparency and Accuracy in Emotion Recognition
Gabriel Grill ... Nazanin Andalibi
Proceedings of the ACM on Human-Computer Interaction | VOL. 6
Gabriel Grill, et. al.Gabriel Grill ... Nazanin Andalibi
30 Mar 2022
Proceedings of the ACM on Human-Computer Interaction | VOL. 6

Vocal emotion recognition in attention-deficit hyperactivity disorder: a meta-analysis
Rohanna C Sells ... Georgia Chronaki
Cognition and Emotion | VOL. ahead-of-print
Rohanna C Sells, et. al.Rohanna C Sells ... Georgia Chronaki
14 Sep 2023
Cognition and Emotion | VOL. ahead-of-print

Expression EEG Multimodal Emotion Recognition Method Based on the Bidirectional LSTM and Attention Mechanism.
Yifeng Zhao ... Deyun Chen
Computational and Mathematical Methods in Medicine | VOL. 2021
Yifeng Zhao, et. al.Yifeng Zhao ... Deyun Chen
11 May 2021
Computational and Mathematical Methods in Medicine | VOL. 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Joint modelling of audio-visual cues using attention mechanisms for emotion recognition

Abstract

Talk to us

Similar Papers

More From: Multimedia Tools and Applications