Multimodal Emotion Recognition Fusion Analysis Adapting BERT With Heterogeneous Feature Unification

Sanghyun Lee,Hanseok Ko,David K Han

doi:10.1109/access.2021.3092735

Abstract

Human communication includes rich emotional content, thus the development of multimodal emotion recognition plays an important role in communication between humans and computers. Because of the complex emotional characteristics of a speaker, emotional recognition remains a challenge, particularly in capturing emotional cues across a variety of modalities, such as speech, facial expressions, and language. Audio and visual cues are particularly vital for a human observer in understanding emotions. However, most previous work on emotion recognition has been based solely on linguistic information, which can overlook various forms of nonverbal information. In this paper, we present a new multimodal emotion recognition approach that improves the BERT model for emotion recognition by combining it with heterogeneous features based on language, audio, and visual modalities. Specifically, we improve the BERT model due to the heterogeneous features of the audio and visual modalities. We introduce the Self-Multi-Attention Fusion module, Multi-Attention fusion module, and Video Fusion module, which are attention based multimodal fusion mechanisms using the recently proposed transformer architecture. We explore the optimal ways to combine fine-grained representations of audio and visual features into a common embedding while combining a pre-trained BERT model with modalities for fine-tuning. In our experiment, we evaluate the commonly used CMU-MOSI, CMU-MOSEI, and IEMOCAP datasets for multimodal sentiment analysis. Ablation analysis indicates that the audio and visual components make a significant contribution to the recognition results, suggesting that these modalities contain highly complementary information for sentiment analysis based on video input. Our method shows that we achieve state-of-the-art performance on the CMU-MOSI, CMU-MOSEI, and IEMOCAP dataset.

Highlights

An effective communication among humans requires intellectual exchange but of sharing contextual emotions
We describe the process of a transformer that can effectively fuse audio and image heterogeneous feature information
1) CMU-MOSI CMU-MOSI consists of 2,199 short monologue video clips, examples of YouTube movie reviews for multimodal emotions and emotion recognition

Summary

INTRODUCTION

An effective communication among humans requires intellectual exchange but of sharing contextual emotions. For deep learning based emotion recognition, [13]–[16] utilized CNN to extract facial features salient to expressed emotions. Another important feature for classifying emotions is the textual content of speech. The effectiveness of these unimodal feature based methods was found to be insufficient to infer the speaker’s sentiment as much of salient emotional features are expressed simultaneously via different modalities [31]. Heterogeneous Features Unification(HFU-BERT), integrates BERT into our architecture to effectively combine heterogeneous features extracted from both handcrafted and deep learning based methods.

RELATED WORK

VISUAL FEATURES

TEXT PREPROCESSING

MULTI-ATTENTION FUSION

EXPERIMENTS

RESULTS AND DISCUSSION

VIII. CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 33	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Multimodal Emotion Recognition Fusion Analysis Adapting BERT With Heterogeneous Feature Unification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Attention-based multimodal sentiment analysis and emotion recognition using deep neural networks
Ajwa Aslam ... Zulfiqar Habib
Applied Soft Computing | VOL. 144
Ajwa Aslam, et. al.Ajwa Aslam ... Zulfiqar Habib
10 Jun 2023
Applied Soft Computing | VOL. 144

A multimodal fusion emotion recognition method based on multitask learning and attention mechanism
Jinbao Xie ... Yury I Varatnitski
Neurocomputing | VOL. 556
Jinbao Xie, et. al.Jinbao Xie ... Yury I Varatnitski
04 Aug 2023
Neurocomputing | VOL. 556

Improvement of Multimodal Emotion Recognition Based on Temporal-Aware Bi-Direction Multi-Scale Network and Multi-Head Attention Mechanisms
Yuezhou Wu ... Pengfei Li
Applied Sciences | VOL. 14
Yuezhou Wu, et. al.Yuezhou Wu ... Pengfei Li
13 Apr 2024
Applied Sciences | VOL. 14

STERM: A Multimodal Speech Emotion Recognition Model in Filipino Gaming Settings
Giorgio Armani G Magno ... Lhuijee Jhulo V Cuchapin
-
Giorgio Armani G Magno, et. al.Giorgio Armani G Magno ... Lhuijee Jhulo V Cuchapin
01 Dec 2022
01 Dec 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multimodal Emotion Recognition Fusion Analysis Adapting BERT With Heterogeneous Feature Unification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access