Abstract

Textbook Question Answering (TQA) task requires answering questions by reasoning based on both the given diagrams and text context. There are mainly two challenges for the task. First, the diagrams are different from the natural images. Similar shapes or color blocks may express different semantics and there is also a large intra-topic variation for diagrams. Hence, the characteristics of visual semantic ambiguity and variable visual appearance make the diagram understanding more challenging. Second, for the text, the specific education domain with terminologies exists a great gap with the general domain. Therefore, it is difficult to represent the text semantics effectively using a text encoder pretrained in the general domain. In this paper, we propose a Spatial-Semantic Collaborative Graph Network (SSCGN) for TQA task, which can help enhance the diagram and text understanding and facilitate multimodal reasoning. Specifically, the Spatial-guided Semantic Enhancing (SSE) module fully exploits the spatial and semantic relationships between visual objects and OCR tokens to collaboratively enhance the diagram semantic understanding. Moreover, based on the semantically enhanced region representations of the SSE module, the Fine-grained Spatial-Aware Graph Network (FSA-GN) can help obtain richer relation-aware region representations for joint reasoning by capturing more fine-grained spatial relationships. We further propose multiple self-supervised auxiliary tasks to enhance the initial diagram and text semantic representations by pretraining the diagram encoder and text encoder. Extensive experiments and ablation studies are conducted to validate the effectiveness of SSCGN.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call