Abstract

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise cross-modal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.

Highlights

  • IntroductionHuman language possesses spoken words and nonverbal behaviors from vision (facial attributes) and acoustic (tone of voice) modalities (Gibson et al, 1994)

  • Human language possesses spoken words and nonverbal behaviors from vision and acoustic modalities (Gibson et al, 1994)

  • When modeling unaligned multimodal language sequences, Multimodal Transformer (MulT) relies on crossmodal attention blocks to merge signals across modalities

Read more

Summary

Introduction

Human language possesses spoken words and nonverbal behaviors from vision (facial attributes) and acoustic (tone of voice) modalities (Gibson et al, 1994). This rich information provides us the benefit of understanding human behaviors and intents (Manning et al, 2014). The heterogeneities across modalities often increase the difficulty of analyzing human language. The receptors for audio and vision streams may vary with variable receiving frequency, and we may not obtain optimal mapping between them. A frowning face may relate to a pessimistically word spoken in the past. (Pre-defined Word-level) Alignment [ Vision ]...[ ]...[ ]...[ ]... Language It’s huge sort of spectacle movie

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call