AVaTER: Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition.

Avishek Das,Moumita Sen Sarma,Mohammed Moshiul Hoque,Nazmul Siddique,M Ali Akber Dewan

doi:10.3390/s24185862

Abstract

Multimodal emotion classification (MEC) involves analyzing and identifying human emotions by integrating data from multiple sources, such as audio, video, and text. This approach leverages the complementary strengths of each modality to enhance the accuracy and robustness of emotion recognition systems. However, one significant challenge is effectively integrating these diverse data sources, each with unique characteristics and levels of noise. Additionally, the scarcity of large, annotated multimodal datasets in Bangla limits the training and evaluation of models. In this work, we unveiled a pioneering multimodal Bangla dataset, MAViT-Bangla (Multimodal Audio Video Text Bangla dataset). This dataset, comprising 1002 samples across audio, video, and text modalities, is a unique resource for emotion recognition studies in the Bangla language. It features emotional categories such as anger, fear, joy, and sadness, providing a comprehensive platform for research. Additionally, we developed a framework for audio, video and textual emotion recognition (i.e., AVaTER) that employs a cross-modal attention mechanism among unimodal features. This mechanism fosters the interaction and fusion of features from different modalities, enhancing the model's ability to capture nuanced emotional cues. The effectiveness of this approach was demonstrated by achieving an F1-score of 0.64, a significant improvement over unimodal methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

AVaTER: Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition.

Abstract

Talk to us

Similar Papers

More From: Sensors (Basel, Switzerland)

Lead the way for us

Journal: Sensors (Basel, Switzerland)	Publication Date: Sep 10, 2024
License type: CC BY 4.0

Similar Papers

Multimodal emotion recognition based on audio and text by using hybrid attention networks
Shiqing Zhang ... Xiaoming Zhao
Biomedical Signal Processing and Control | VOL. 85
Shiqing Zhang, et. al.Shiqing Zhang ... Xiaoming Zhao
30 May 2023
Biomedical Signal Processing and Control | VOL. 85

Advancing Fine-Grained Emotion Recognition in Short Text

-

01 Jan 2015
01 Jan 2015

A survey of state-of-the-art approaches for emotion recognition in text
Nourah Alswaidan ... Mohamed El Bachir Menai
Knowledge and Information Systems | VOL. 62
Nourah Alswaidan, et. al.Nourah Alswaidan ... Mohamed El Bachir Menai
18 Mar 2020
Knowledge and Information Systems | VOL. 62

Recognition of Emotions of Speech and Mood of Music: A Review
Gaurav Agarwal ... Sushila Maheshkar
-
Gaurav Agarwal, et. al.Gaurav Agarwal ... Sushila Maheshkar
01 Jan 2018
01 Jan 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

AVaTER: Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition.

Abstract

Talk to us

Similar Papers

More From: Sensors (Basel, Switzerland)