Investigation of Explainability Techniques for Multimodal Transformers
Abstract Multimodal transformers such as CLIP and ViLBERT have become increasingly popular for visiolinguistic tasks as they have an efficient and generalizable understanding of visual features and labels. Notable examples of visiolinguistic models include OpenAI’s CLIP by Radford et al. and VilBERT by Lu et al. One of the gaps in current multimodal transformers is that there are no unified explainability frameworks to compare attention interactions meaningfully between models. To address the comparability concern, we investigate two different explainability frameworks. Specifically, Label Attribution and Optimal Transport of Vision-Language semantic spaces with the VisualBERT multimodal transformer model provide an interpretability process towards understanding attention interactions in multimodal transformers. We provide a case study of the Visual Genome and Question Answer 2 Datasets trained using VisualBERT.KeywordsMultimodal transformersLabel attributionOptimal transport
- Research Article
- 10.1007/s00521-025-11721-5
- Feb 1, 2026
- Neural Computing and Applications
Traditional visual question answering (VQA) tasks focus on surface image-text matching, while visual text question answering (VTQA) tasks require deeper cross-modal reasoning. Current Transformer-based models are insufficient in screening effective features. To address these issues, this paper proposes a new cross-media reasoning network (VTFCGNet) that integrates Fourier frequency domain and spatial domain self-attention and graph attention mechanisms. The network can adaptively weight the feature interactions between different modalities, achieve deep fusion of image, text, and question modalities, and overcome the limitations of existing models in VTQA tasks. VTFCGNet first extracts key entities based on the entity extraction network (VTFC-Net) in both the Fourier frequency domain and the spatial domain, thereby reducing the interference of redundant features compared to the traditional self-attention mechanism. Secondly, a cross-media reasoning network (CRG-Net) is employed for multi-step cross-media reasoning, significantly enhancing its ability to capture fine-grained features and model cross-modal relationships compared to traditional VQA models. Finally, comprehensive experiments on VTQA and VQA v2 datasets—using both grid-level and region-level visual features of region proposals—validate the outstanding performance of VTFCGNet. The findings demonstrate that VTFCGNet achieved top accuracies of 71.93% and 75.83% on the VQA v2 test-dev and VTQA test (English Version) datasets, respectively.
- Conference Article
27
- 10.18653/v1/p19-1351
- Jan 1, 2019
Paragraph-style image captions describe diverse aspects of an image as opposed to the more common single-sentence captions that only provide an abstract description of the image. These paragraph captions can hence contain substantial information of the image for tasks such as visual question answering. Moreover, this textual information is complementary with visual information present in the image because it can discuss both more abstract concepts and more explicit, intermediate symbolic information about objects, events, and scenes that can directly be matched with the textual question and copied into the textual answer (i.e., via easier modality match). Hence, we propose a combined Visual and Textual Question Answering (VTQA) model which takes as input a paragraph caption as well as the corresponding image, and answers the given question based on both inputs. In our model, the inputs are fused to extract related information by cross-attention (early fusion), then fused again in the form of consensus (late fusion), and finally expected answers are given an extra score to enhance the chance of selection (later fusion). Empirical results show that paragraph captions, even when automatically generated (via an RL-based encoder-decoder model), help correctly answer more visual questions. Overall, our joint model, when trained on the Visual Genome dataset, significantly improves the VQA performance over a strong baseline model.
- Conference Article
151
- 10.1109/cvpr52688.2022.00493
- Jun 1, 2022
The referring video object segmentation task (RVOS) involves segmentation of a text-referred object instance in the frames of a given video. Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it. In this paper, we propose a simple Transformer-based approach to RVOS. Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence prediction problem. Following recent advancements in computer vision and natural language processing, MTTR is based on the realization that video and text can be processed together effectively and elegantly by a single multimodal Transformer model. MTTR is end-to-end trainable, free of text-related inductive bias components and requires no additional mask-refinement post-processing steps. As such, it simplifies the RVOS pipeline considerably compared to existing methods. Evaluation on standard benchmarks reveals that MTTR significantly outperforms previous art across multiple metrics. In particular, MTTR shows impressive +5.7 and +5.0 mAP gains on the A2D-Sentences and JHMDB-Sentences datasets respectively, while processing 76 frames per second. In addition, we report strong results on the public validation set of Refer-YouTube-VOS, a more challenging RVOS dataset that has yet to receive the attention of researchers. The code to reproduce our experiments is avail-able at https://github.com/mttr2021/MTTR.
- Research Article
- 10.37675/jat.2025.00759
- Dec 30, 2025
- Academic Society for Appropriate Technology
Accelerating climate change and the intensifying global food security crisis have increased the importance of reliable crop classification across diverse environmental conditions. Existing crop classification models have primarily focused on improving accuracy by learning spectral and temporal patterns from satellite imagery; however, their black-box nature makes it difficult to understand the rationale behind each prediction, limiting their applicability in real-world agricultural decision-making. To address this issue, this study introduces a multimodal Transformer model that incorporates a BERTbased bidirectional attention mechanism, aiming to retain classification performance while enhancing interpretability. The proposed BERT Hybrid model employs a PVT backbone to extract spatial features from Sentinel-2 satellite imagery and integrates them with meteorological time-series embeddings; bidirectional self-attention is then used to jointly model cross-temporal and cross-modal interactions. We further conduct comparative experiments under the same conditions as the MMST-ViT(Multi-Modal Spatial-Temporal Vision Transformer) baseline, evaluating not only overall accuracy but also temporal attention patterns across crop growth stages and the relative importance of different weather variables. Experimental results show that bidirectional attention alleviates excessive focus on specific timestamps or single variables, producing more consistent and interpretable attention distributions. This study highlights the performance– interpretability trade-off in multimodal agricultural AI models and provides a foundation for building trustworthy deeplearning systems for crop monitoring. In addition, because the proposed approach relies solely on globally accessible Sentinel-2 satellite imagery and publicly available meteorological data, it demonstrates the potential for constructing large-scale crop monitoring systems at low cost, aligning with the principles of appropriate technology.
- Conference Article
209
- 10.1109/cvpr.2018.00640
- Jun 1, 2018
Visual question answering (VQA) and visual question generation (VQG) are two trending topics in the computer vision, but they are usually explored separately despite their intrinsic complementary relationship. In this paper, we propose an end-to-end unified model, the Invertible Question Answering Network (iQAN), to introduce question generation as a dual task of question answering to improve the VQA performance. With our proposed invertible bilinear fusion module and parameter sharing scheme, our iQAN can accomplish VQA and its dual task VQG simultaneously. By jointly trained on two tasks with our proposed dual regularizes (termed as Dual Training), our model has a better understanding of the interactions among images, questions and answers. After training, iQAN can take either question or answer as input, and output the counterpart. Evaluated on the CLEVR and VQA2 datasets, our iQAN improves the top-1 accuracy of the prior art MUTAN VQA method by 1.33% and 0.88% (absolute increase) respectiely. We also show that our proposed dual training framework can consistently improve model performances of many popular VQA architectures <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> .
- Conference Article
2
- 10.1109/lifetech48969.2020.1570619128
- Mar 1, 2020
We propose a model for free-form visual question answering (VQA) from human brain activity. The task of VQA is leading to an answer given an image and a question about the image. Given brain activity data measured by functional magnetic resonance imaging (fMRI) and a natural language question in terms of the viewed image, the proposed method can provide an accurate natural language answer with the VQA algorithm. Visual questions selectively approach various areas of an image such as objects and backgrounds. As a result, a more detailed understanding of the image and complex reasoning are typically needed than general image captioning models. In this paper, we propose a method of answering a given question about a viewed image from fMRI data based on the VQA algorithm. We estimate the relation between fMRI data and visual features extracted from viewed images. Based on the relationship, we convert fMRI data into visual features. Finally, the proposed method can answer to a visual question from fMRI data measured while subjects are viewing images. Experimental results show that the proposed method enables accurate answering for questions about viewed images.
- Research Article
19
- 10.1109/tpami.2024.3398012
- Dec 1, 2024
- IEEE transactions on pattern analysis and machine intelligence
Recently, a novel multimodal reasoning task named Explanatory Visual Question Answering (EVQA) has been introduced, which combines answering visual questions with multimodal explanation generation to expound upon the underlying reasoning processes. In contrast to conventional Visual Question Answering (VQA) that merely concentrates on providing answers, EVQA aims to improve the explainability and verifiability of reasoning by providing user-friendly explanations. Despite the improved explainability of inferred results, the existing EVQA models still adopt black-box neural networks to infer results, lacking the explainability of the reasoning process. Moreover, existing EVQA models commonly predict answers and explanations in isolation, overlooking the inherent causal correlation between them. To handle these challenges, we propose a Program-guided Variational Causal Inference Network (Pro-VCIN) that integrates neural-symbolic reasoning with variational causal inference and constructs causal correlations between the predicted answers and explanations. First, we utilize pretrained models to extract visual features and convert questions into the corresponding programs. Second, we propose a multimodal program Transformer to translate programs and the related visual features into coherent and rational explanations of the reasoning processes. Finally, we propose a variational causal inference to construct the target structural causal model and predict answers based on the causal correlation to explanations. Comprehensive experiments conducted on EVQA benchmark datasets reveal the superiority of Pro-VCIN in terms of both performance and explainability over state-of-the-art EVQA methods.
- Book Chapter
1
- 10.1007/978-3-030-38445-6_8
- Jan 1, 2020
This paper delineates the automation of question generation as an extension to existing Visual Question Answering (VQA) systems. Through our research, we have been able to build a system that can generate questions and answer pairs on images. It consists of two separate modules—Visual Question Generation (VQG) which generates questions based on the image, and a Visual Question Answering (VQA) module that produces a befitting answer that the VQG module generates. Through our approach, we not only generate questions but evaluate the questions generated by using a question answering system. Moreover, with our methodology, we can generate question-answer pairs as well as improve the performance of VQA models. It eliminates the need for human intervention in dataset annotation and also finds applications in the field of the educational sector, where the requirement of human input for textual questions has been essential till now. Using our system, we aim to provide an interactive interface which helps the learning process among young children.
- Conference Article
- 10.1145/3675888.3676107
- Aug 8, 2024
The advent of Visual Question Answering (VQA) technology has brought significant advancements in the medical field, offering transformative potential in clinical diagnostics and patient care. This research explores the application of VQA within the medical domain, highlighting its critical role in interpreting complex visual data, such as radiological images, pathology slides, and other diagnostic visuals. Traditional diagnostic processes often rely heavily on human expertise, which can be time-consuming and prone to variability. VQA systems, powered by sophisticated machine learning models, provide consistent and accurate interpretations, thus enhancing diagnostic accuracy and efficiency. Visual Question Answering (VQA) in the medical field necessitates extracting information from both textual and visual inputs to provide accurate answers, a critical requirement for supporting medical decision-making. This research introduces a novel approach to address VQA challenges in the medical domain using Bi-Directional Layout with Positional Encoding (BLIP) models. Our methodology seamlessly integrates text and image processing within a unified framework, enabling precise interactions between textual queries and medical imaging data. We commence with textual inputs, encoded by BLIP processors, and medical images, encoded by BLIP image processors. A custom VQA dataset, specifically designed for the medical field, includes textual questions and their corresponding medical image features. We employ a BLIP-based Question Answering architecture, fine-tuned on our medical VQA dataset, and optimized using the AdamW optimizer with a learning rate of 0.00005, ensuring efficient convergence. Additionally, we introduce attention mechanisms using Coarse and Fine Attention blocks for enhanced feature fusion and accurate answer prediction. Our results are highly encouraging, demonstrating competitive metrics in extensive VQA task experiments on both training and validation datasets. Qualitative analysis of sample predictions indicates the model’s capability to provide accurate answers for diverse visual and textual medical inputs. This work holds significant promise for improving automated medical image analysis and supporting clinical decision-making.
- Research Article
467
- 10.1109/tcsvt.2019.2947482
- Oct 25, 2019
- IEEE Transactions on Circuits and Systems for Video Technology
Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN) based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning. Compared to existing image captioning approaches, the MT model simultaneously captures intra- and inter-modal interactions in a unified attention block. Due to the in-depth modular composition of such attention blocks, the MT model can perform complex multimodal reasoning and output accurate captions. Moreover, to further improve the image captioning performance, multi-view visual features are seamlessly introduced into the MT model. We quantitatively and qualitatively evaluate our approach using the benchmark MSCOCO image captioning dataset and conduct extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method significantly outperforms the previous state-of-the-art methods. With an ensemble of seven models, our solution ranks the 1st place on the real-time leaderboard of the MSCOCO image captioning challenge at the time of the writing of this paper.
- Conference Article
298
- 10.1109/iccv48922.2021.00398
- Oct 1, 2021
Survival outcome prediction is a challenging weakly-supervised and ordinal regression task in computational pathology that involves modeling complex interactions within the tumor microenvironment in gigapixel whole slide images (WSIs). Despite recent progress in formulating WSIs as bags for multiple instance learning (MIL), representation learning of entire WSIs remains an open and challenging problem, especially in overcoming: 1) the computational complexity of feature aggregation in large bags, and 2) the data heterogeneity gap in incorporating biological priors such as genomic measurements. In this work, we present a Multimodal Co-Attention Transformer (MCAT) framework that learns an interpretable, dense co-attention mapping between WSIs and genomic features formulated in an embedding space. Inspired by approaches in Visual Question Answering (VQA) that can attribute how word embed-dings attend to salient objects in an image when answering a question, MCAT learns how histology patches attend to genes when predicting patient survival. In addition to visualizing multimodal interactions, our co-attention trans-formation also reduces the space complexity of WSI bags, which enables the adaptation of Transformer layers as a general encoder backbone in MIL. We apply our proposed method on five different cancer datasets (4,730 WSIs, 67 million patches). Our experimental results demonstrate that the proposed method consistently achieves superior performance compared to the state-of-the-art methods.
- Research Article
15
- 10.1145/3534619
- Jul 4, 2022
- Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
Visual Question Answering (VQA) is a relatively new task where a user can ask a natural question about an image and obtain an answer. VQA is useful for many applications and is widely popular for users with visual impairments. Our goal is to design a VQA application that works efficiently on mobile devices without requiring cloud support. Such a system will allow users to ask visual questions privately, without having to send their questions to the cloud, while also reduce cloud communication costs. However, existing VQA applications use deep learning models that significantly improve accuracy, but is computationally heavy. Unfortunately, existing techniques that optimize deep learning for mobile devices cannot be applied for VQA because the VQA task is multi-modal---it requires both processing vision and text data. Existing mobile optimizations that work for vision-only or text-only neural networks cannot be applied here because of the dependencies between the two modes. Instead, we design MobiVQA, a set of optimizations that leverage the multi-modal nature of VQA. We show using extensive evaluation on two VQA testbeds and two mobile platforms, that MobiVQA significantly improves latency and energy with minimal accuracy loss compared to state-of-the-art VQA models. For instance, MobiVQA can answer a visual question in 163 milliseconds on the phone, compared to over 20-second latency incurred by the most accurate state-of-the-art model, while incurring less than 1 point reduction in accuracy.
- Conference Article
90
- 10.1109/iccv.2019.00592
- Oct 1, 2019
Exploiting relationships between visual regions and question words have achieved great success in learning multi-modality features for Visual Question Answering (VQA). However, we argue that existing methods mostly model relations between individual visual regions and words, which are not enough to correctly answer the question. From humans' perspective, answering a visual question requires understanding the summarizations of visual and language information. In this paper, we proposed the Multi-modality Latent Interaction module (MLI) to tackle this problem. The proposed module learns the cross-modality relationships between latent visual and language summarizations, which summarize visual regions and question into a small number of latent representations to avoid modeling uninformative individual region-word relations. The cross-modality information between the latent summarizations are propagated to fuse valuable information from both modalities and are used to update the visual and word features. Such MLI modules can be stacked for several stages to model complex and latent relations between the two modalities and achieves highly competitive performance on public VQA benchmarks, VQA v2.0 and TDIUC . In addition, we show that the performance of our methods could be significantly improved by combining with pre-trained language model BERT.
- Research Article
15
- 10.1145/3313873
- Apr 30, 2019
- ACM Transactions on Multimedia Computing, Communications, and Applications
Image captioning and visual question answering are typical tasks that connect computer vision and natural language processing. Both of them need to effectively represent the visual content using computer vision methods and smoothly process the text sentence using natural language processing skills. The key problem of these two tasks is to infer the target result based on the interactive understanding of the word sequence and the image. Though they practically use similar algorithms, they are studied independently in the past few years. In this article, we attempt to exploit the mutual correlation between these two tasks. We propose the first VQA-improved image-captioning method that transfers the knowledge learned from the VQA corpora to the image-captioning task. A VQA model is first pretrained on image--question--answer instances. Then, the pretrained VQA model is used to extract VQA-grounded semantic representations according to selected free-form open-ended visual question--answer pairs. The VQA-grounded features are complementary to the visual features, because they interpret images from a different perspective. We incorporate the VQA model into the image-captioning model by adaptively fusing the VQA-grounded feature and the attended visual feature. We show that such simple VQA-improved image-captioning (VQA-IIC) models perform better than conventional image-captioning methods on large-scale public datasets.
- Conference Article
110
- 10.1109/cvpr.2018.00603
- Jun 1, 2018
Human conversation is a complex mechanism with subtle nuances. It is hence an ambitious goal to develop artificial intelligence agents that can participate fluently in a conversation. While we are still far from achieving this goal, recent progress in visual question answering, image captioning, and visual question generation shows that dialog systems may be realizable in the not too distant future. To this end, a novel dataset was introduced recently and encouraging results were demonstrated, particularly for question answering. In this paper, we demonstrate a simple symmetric discriminative baseline, that can be applied to both predicting an answer as well as predicting a question. We show that this method performs on par with the state of the art, even memory net based methods. In addition, for the first time on the visual dialog dataset, we assess the performance of a system asking questions, and demonstrate how visual dialog can be generated from discriminative question generation and question answering.