Textbook question answering (TQA) is a task that one should answer non-diagram and diagram questions accurately, given a large context which consists of abundant diagrams and essays. Although lots of studies have made significant progress in the natural image question answering (QA), they are not applicable to comprehending diagrams and reasoning over the long multimodal context. To address the above issues, we propose a relation-aware fine-grained reasoning (RAFR) network that performs fine-grained reasoning over the nodes of relation-based diagram graphs. Our method uses semantic dependencies and relative positions between nodes in the diagram to construct relation graphs and applies graph attention networks to learn diagram representations. To extract and reason over the multimodal knowledge, we first extract the text that is the most relevant to questions, options, and the instructional diagram which is the most relevant to question diagrams at the word-sentence level and the node-diagram level, respectively. Then, we apply instructional-diagram-guided attention and question-guided attention to reason over the node of question diagrams, respectively. The experimental results show that our proposed method achieves the best performance on the TQA dataset compared with baselines. We also conduct extensive ablation studies to comprehensively analyze the proposed method.
Read full abstract