Boosting Generic Visual-Linguistic Representation With Dynamic Contexts

Guoqing Ma,Yalong Bai,Tao Mei,Ting Yao,Wei Zhang,Basem Shihada

doi:10.1109/tmm.2023.3237164

Abstract

Pretraining large models on generous multi-modal corpora has accelerated the development of visual-linguistic (VL) representation and achieved great success on various vision-and-language downstream tasks. Learning these models is usually executed by predicting the randomly masked words of captions or patches in images. Such approaches, nevertheless, seldom explore the supervision of causalities behind the caption descriptions or the procedure of generating events beyond still images. In this work, we endow the pretrained models with high-level cognition by delving into dynamic contexts to model the visual and linguistic causalities uniformly. Specifically, we format the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">dynamic contexts</i> of an image as the sentences describing the events <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">before</i> , <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">on</i> , and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">after</i> image. Unlike traditional caption-wise similarity, we propose a novel dynamic contexts-based similarity (DCS) metric, in which the correlation of potential causes and effects besides immediate visual content are considered to measure the relevance among images. DCS can be further simplified by parameterizing event continuity to relax the requirements on dense contextual event annotations. A new pre-task is designed to minimize the feature distances of dynamically contextual relevant images and incorporate the event causality and commonsense knowledge into the VL representation learning. Models based on our dynamic contexts significantly outperform typical VL models on multiple cross-modal downstream tasks, including the conventional visual commonsense reasoning (VCR), visual question answering (VQA), zero-shot image-text retrieval, and extended image / event ordering tasks.

Full Text