Abstract

Recent works have demonstrated the efficacy of Chain-of-Thought (CoT), which comprises multimodal information, in multiple complex reasoning tasks. CoT, involving multiple stages of reasoning, has also been applied to Visual Question Answering (VQA) for scientific questions. Existing research on CoT in science-oriented VQA primarily concentrates on the extraction and integration of visual and textual information. However, they overlook the fact that image-question pairs, categorized by different attributes (such as subject, topic, category, skill, grade, and difficulty), emphasize distinct text information, visual information, and reasoning capabilities. Therefore, this work proposes a novel VQA method termed PGCL, founded on the prompt guidance strategy and self-supervised contrastive learning. PGCL strategically excavates and integrates text and visual information based on attribute information. Specifically, two prompt templates are first crafted. They are subsequently combined with the attribution information and the interference information of image-question pairs to generate a series of prompt positive and prompt negative samples respectively. The mining of visual and text representations is then guided by constructed prompts. These prompt-guided representations are integrated and enhanced via transformer architecture and self-supervised contrastive learning. The fused features are eventually learned to predict answers for VQA. Sufficient experiments have convincingly substantiated the individual contributions of the components within PGCL, as well as the performance of PGCL.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call