SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Zhecan Wang,Liunian Harold Li,Yiqing Liang,Suji Park,Haoxuan You,Alireza Zareian,Kai-Wei Chang,Shih-Fu Chang

doi:10.1609/aaai.v36i5.20536

Abstract

Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made a great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graph in commonsense reasoning. In order to exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in visual scene graph. Moreover, we introduce a method to train and generate domain relevant visual scene graph using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show significant performance boost compared with the state-of-the-art methods, and prove the efficacy of each proposed component.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Jun 28, 2022
Citations: 14

Similar Papers

NeuSyRE: Neuro-symbolic visual understanding and reasoning framework based on scene graph enrichment
M Jaleed Khan ... Edward Curry
Semantic Web | VOL. 15
M Jaleed Khan, et. al.M Jaleed Khan ... Edward Curry
04 Oct 2024
Semantic Web | VOL. 15

Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning
Muhammad Jaleed Khan ... John G Breslin
-
Muhammad Jaleed Khan, et. al.Muhammad Jaleed Khan ... John G Breslin
01 Jan 2021
01 Jan 2021

CORECODE: A Common Sense Annotated Dialogue Dataset with Benchmark Tasks for Chinese Large Language Models
Dan Shi ... Jiantao Huang
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Dan Shi, et. al.Dan Shi ... Jiantao Huang
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Commonsense Reasoning for Natural Language Processing
Maarten Sap ... Antoine Bosselut
-
Maarten Sap, et. al.Maarten Sap ... Antoine Bosselut
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence