Transformers have become essential for capturing intra- and inter-dependencies in visual question answering (VQA). Yet, challenges remain in overcoming inherent comprehension biases and improving the relational dependency modeling and reasoning capabilities crucial for VQA tasks. This paper presents RMCB, a novel VQA model designed to mitigate these biases by integrating contextual information from both visual and linguistic sources and addressing potential comprehension limitations at each end. RMCB introduces enhanced relational modeling for language tokens by leveraging textual context, addressing comprehension biases arising from the isolated pairwise modeling of token relationships. For the visual component, RMCB systematically incorporates both absolute and relative spatial relational information as contextual cues for image tokens, refining dependency modeling and strengthening inferential reasoning to alleviate biases caused by limited contextual understanding. The model’s effectiveness was evaluated on benchmark datasets VQA-v2 and CLEVR, achieving state-of-the-art results with accuracies of 71.78% and 99.27%, respectively. These results underscore RMCB’s capability to effectively address comprehension biases while advancing the relational reasoning needed for VQA.
Read full abstract