Integrating Multimodal Information in Large Pretrained Transformers.

Wasifur Rahman,Chengfeng Mao,Sangwu Lee,Amirali Bagher Zadeh,Md Kamrul Hasan,Louis-Philippe Morency,Ehsan Hoque

doi:10.18653/v1/2020.acl-main.214

Abstract

Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While fine-tuning these pre-trained models is straight-forward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). Pre-trained models don't have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment to BERT and XLNet called Multimodal Adaptation Gate (MAG). MAG allows BERT and XLNet to accept multimodal nonverbal data during fine-tuning. It does so by generating a shift to internal representation of BERT and XLNet; a shift that is conditioned on the visual and acoustic modalities. In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis. Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine-tuning of BERT and XLNet. On the CMU-MOSI dataset, MAG-XLNet achieves human-level multimodal sentiment analysis performance for the first time in the NLP community.

Highlights

Human face-to-face communication flows as a seamless integration of language, acoustic, and vision modalities
We summarize the observations from the results in this table as following: 6.1 Performance of Multimodal Adaptation Gate (MAG)-BERT
This essentially shows that the MAG component is allowing the BERT model to adapt to multimodal information during fine-tuning, achieving superior performance

Summary

Introduction

Human face-to-face communication flows as a seamless integration of language, acoustic, and vision modalities. * - Equal contribution intentions and emotions. Understanding this faceto-face communication falls within an increasingly growing NLP research area called multimodal language analysis (Zadeh et al, 2018b). The biggest challenge in this area is to efficiently model the three pillars of communication together. This gives artificial intelligence systems the capability to comprehend the multi-sensory information without disregarding nonverbal factors. In many applications such as dialogue systems and virtual reality, this capability is crucial to maintain the high quality of user interaction

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Proceedings of the conference. Association for Computational Linguistics. Meeting	Publication Date: Jan 1, 2020
Citations: 231	License type: cc-by

R Discovery Prime

R Discovery Prime

Integrating Multimodal Information in Large Pretrained Transformers.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the conference. Association for Computational Linguistics. Meeting

Lead the way for us

Similar Papers

HMTL: Heterogeneous Modality Transfer Learning for Audio-Visual Sentiment Analysis
Sanghyun Seo ... Juntae Kim
IEEE Access | VOL. 8
Sanghyun Seo, et. al.Sanghyun Seo ... Juntae Kim
01 Jan 2020
IEEE Access | VOL. 8

AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model
Ji Mingyu ... Sriparna Saha
PLOS ONE | VOL. 17
Ji Mingyu, et. al.Ji Mingyu ... Sriparna Saha
09 Sep 2022
PLOS ONE | VOL. 17

AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model
Sriparna Saha ... Ji Mingyu
-
Sriparna Saha, et. al.Sriparna Saha ... Ji Mingyu
09 Sep 2022
09 Sep 2022

Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis
Md Shad Akhtar ... Dushyant Chauhan
-
Md Shad Akhtar, et. al.Md Shad Akhtar ... Dushyant Chauhan
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Integrating Multimodal Information in Large Pretrained Transformers.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the conference. Association for Computational Linguistics. Meeting