Leveraging Generated Image Captions for Visual Commonsense Reasoning

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Visual Commonsense Reasoning (VCR) involves cognition-level visual understanding by drawing accurate conclusions based on thorough visual understanding. Unlike Visual Question Answering (VQA), where the model merely chooses a correct answer, VCR requires models to not only pick an answer but also identify an appropriate rationale. Traditionally, VCR models have predominantly relied on visual data for their reasoning processes. However, achieving a comprehensive understanding of an image remains a challenging task, as it often requires reasoning beyond visual cues alone. We propose to use the generated image captions to enhance the VCR model’s reasoning capabilities. We propose fusion strategies to integrate the image caption into the VCR model, enabling a better understanding of the image. To evaluate the effectiveness of our proposed approach, we conduct experiments on the benchmark VCR dataset. The results demonstrate that the late fusion strategy enhances the performance of baseline VCR models, yielding an improved accuracy and reasoning capability.

Save Icon
Up Arrow
Open/Close