Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!

Jack Hessel,Lillian Lee

doi:10.18653/v1/2020.emnlp-main.62

Abstract

Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However, sometimes high-performing black-box algorithms turn out to be mostly exploiting unimodal signals in the data. We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task. This function projection modifies model predictions so that cross-modal interactions are eliminated, isolating the additive, unimodal structure. For seven image+text classification tasks (on each of which we set new state-of-the-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation. Surprisingly, this holds even when expressive models, with capacity to consider interactions, otherwise outperform less expressive models; thus, performance improvements, even when present, often cannot be attributed to consideration of cross-modal feature interactions. We hence recommend that researchers in multimodal machine learning report the performance not only of unimodal baselines, but also the EMAP of their best-performing model.

Highlights

Given the presumed importance of reasoning across modalities in multimodal machine learning tasks, we should evaluate a model’s ability to leverage cross-modal interactions
We propose Empirical Multimodally-Additive1 function Projection (EMAP) as an additional diagnostic for analyzing multimodal classification models
The performance of our baseline additive linear model is strong, but we are usually able to find an interactive model that outperforms this linear baseline, e.g., in the case of TST2, a polynomial kernel SVM outperforms the linear model by 4 accuracy points

Summary

Introduction

Given the presumed importance of reasoning across modalities in multimodal machine learning tasks, we should evaluate a model’s ability to leverage cross-modal interactions. Such evaluation is not straightforward; for example, an early Visual Question-Answering (VQA) challenge was later “broken” by a high-performing method that ignored the image entirely (Jabri et al, 2016). One response is to create multimodal-reasoning datasets that are and cleverly balanced to resist language-only or visual-only models; examples are VQA 2.0 (Goyal et al, 2017), NLVR2. We consider models f that assign scores to textual-visual pairs (t, v), where t is a piece of text (e.g., a sentence), and v is an image..

Objectives

Results

Conclusion