Visual Question Answering (VQA) is a multimodal task requiring a collaborative understanding of fine-grained visual concepts and language semantics. The key to VQA research is constructing a framework for modeling inter and intramodal information interactions between intricate modalities. Due to the superior global view and the ability to capture the relationships within multimodal information, the Transformer and its variants have become the prime choice for VQA tasks. Despite this layer-by-layer architecture enabling answer reasoning by optimizing multimodal feature information, the information may still be lost when transferred from lower to higher layers. To solve such an issue, we propose a Layer-Residual Mechanism (LRM), a plug-and-play generic approach that can reduce the computation and memory overhead to almost negligible. By adding a residual straight-through line between adjacent layers that cascades the attention block in-depth, LRM mitigates the vanishing during information transfer, stabilizes training, and endeavors to overcome performance decline in VQA models at deeper layers. To verify the effectiveness and generality of the proposed LRM, we apply it to the Encoder-Decoder structure, the Pure-Stacking structure, and a specifically designed Co-Stacking structure that can simultaneously understand textual features and visual features called Layer-residual Co-Attention Networks (LRCN). Extensive ablation studies and comparative experiments on the benchmark VQA v2 and CLEVR datasets show that LRCN significantly outperforms the original architectures, demonstrating the superior effectiveness and compatibility of LRCN.
Read full abstract