Multimodal Transformer for Unaligned Multimodal Language Sequences.

Yao-Hung Hubert Tsai,J Zico Kolter,Shaojie Bai,Paul Pu Liang,Louis-Philippe Morency,Ruslan Salakhutdinov

doi:10.18653/v1/p19-1656

Abstract

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise cross-modal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.

Highlights

IntroductionHuman language possesses spoken words and nonverbal behaviors from vision (facial attributes) and acoustic (tone of voice) modalities (Gibson et al, 1994)
Human language possesses spoken words and nonverbal behaviors from vision and acoustic modalities (Gibson et al, 1994)
When modeling unaligned multimodal language sequences, Multimodal Transformer (MulT) relies on crossmodal attention blocks to merge signals across modalities

Summary

Introduction

Human language possesses spoken words and nonverbal behaviors from vision (facial attributes) and acoustic (tone of voice) modalities (Gibson et al, 1994). This rich information provides us the benefit of understanding human behaviors and intents (Manning et al, 2014). The heterogeneities across modalities often increase the difficulty of analyzing human language. The receptors for audio and vision streams may vary with variable receiving frequency, and we may not obtain optimal mapping between them. A frowning face may relate to a pessimistically word spoken in the past. (Pre-defined Word-level) Alignment [ Vision ]...[ ]...[ ]...[ ]... Language It’s huge sort of spectacle movie

Objectives

Methods

Findings

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Proceedings of the conference. Association for Computational Linguistics. Meeting	Publication Date: Jan 1, 2019
Citations: 646	License type: cc-by

R Discovery Prime

R Discovery Prime

Multimodal Transformer for Unaligned Multimodal Language Sequences.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the conference. Association for Computational Linguistics. Meeting

Lead the way for us

Similar Papers

Progressive Modality Reinforcement for Human Multimodal Emotion Recognition from Unaligned Multimodal Sequences
Fengmao Lv ... Lixin Duan
-
Fengmao Lv, et. al.Fengmao Lv ... Lixin Duan
01 Jun 2021
01 Jun 2021

Graph Capsule Aggregation for Unaligned Multimodal Sequences
Jianfeng Wu ... Sijie Mai
-
Jianfeng Wu, et. al.Jianfeng Wu ... Sijie Mai
18 Oct 2021
18 Oct 2021

Cross-modality reinforcement for unaligned sequences sentiment analysis
Fan Wang ... Yongtao Wang
Journal of Intelligent & Fuzzy Systems | VOL. 43
Fan Wang, et. al.Fan Wang ... Yongtao Wang
22 Sep 2022
Journal of Intelligent & Fuzzy Systems | VOL. 43

MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis
Qingfu Qi ... Rui Zhang
IEEE access : practical innovations, open solutions | VOL. 10
Qingfu Qi, et. al.Qingfu Qi ... Rui Zhang
01 Jan 2021
IEEE access : practical innovations, open solutions | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multimodal Transformer for Unaligned Multimodal Language Sequences.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the conference. Association for Computational Linguistics. Meeting