Abstract

Transformer and its variants have become the preferred option for multimodal vision-language paradigms. However, they struggle with tasks that demand high-dependency modeling and reasoning, like visual question answering (VQA) and visual grounding (VG). For this, we propose a general scheme called MPCCT, which: (1) incorporates designed textual global-context information to facilitate precise computation of dependency relationships between language tokens in the language encoder; (2) dynamically modulates and filters image features using optimized textual global-context information, combined with designed spatial context information, to further enhance the dependency modeling of image tokens and the model’s reasoning ability; (3) reasonably align the language sequence containing textual global-context information with the image sequence information modulated by spatial position information. To validate MPCCT, we conducted extensive experiments on five benchmark datasets in VQA and VG, achieving new SOTA performance on multiple benchmarks, especially 73.71% on VQA-v2 and 99.15% on CLEVR. The code is available at https://github.com/RainyMoo/myvqa.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call