As an interdisciplinary field of natural language processing and computer vision, Visual Question Answering (VQA) has emerged as a prominent research focus in artificial intelligence. The core of the VQA task is to combine natural language understanding and image analysis to infer answers by extracting meaningful features from textual and visual inputs. However, most current models struggle to fully capture the deep semantic relationships between images and text owing to their limited capacity to comprehend feature interactions, which constrains their performance. To address these challenges, this paper proposes an innovative Trilinear Multigranularity and Multimodal Adaptive Fusion algorithm (TriMMF) that is designed to improve the efficiency of multimodal feature extraction and fusion in VQA tasks. Specifically, the TriMMF consists of three key modules: (1) an Answer Generation Module, which generates candidate answers by extracting fused features and leveraging question features to focus on critical regions within the image; (2) a Fine-grained and Coarse-grained Interaction Module, which achieves multimodal interaction between question and image features at different granularities and incorporates implicit answer information to capture complex multimodal correlations; and (3) an Adaptive Weight Fusion Module, which selectively integrates coarse-grained and fine-grained interaction features based on task requirements, thereby enhancing the model’s robustness and generalization capability. Experimental results demonstrate that the proposed TriMMF significantly outperforms existing methods on the VQA v1.0 and VQA v2.0 datasets, achieving state-of-the-art performance in question–answer accuracy. These findings indicate that the TriMMF effectively captures the deep semantic associations between images and text. The proposed approach provides new insights into multimodal interaction and fusion research, combining domain adaptation techniques to address a broader range of cross-domain visual question answering tasks.
Read full abstract