ET: Explain to Train: Leveraging Explanations to Enhance the Training of A Multimodal Transformer

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

International audience

Similar Papers
  • Conference Article
  • Cite Count Icon 149
  • 10.1109/cvpr52688.2022.00493
End-to-End Referring Video Object Segmentation with Multimodal Transformers
  • Jun 1, 2022
  • Adam Botach + 2 more

The referring video object segmentation task (RVOS) involves segmentation of a text-referred object instance in the frames of a given video. Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it. In this paper, we propose a simple Transformer-based approach to RVOS. Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence prediction problem. Following recent advancements in computer vision and natural language processing, MTTR is based on the realization that video and text can be processed together effectively and elegantly by a single multimodal Transformer model. MTTR is end-to-end trainable, free of text-related inductive bias components and requires no additional mask-refinement post-processing steps. As such, it simplifies the RVOS pipeline considerably compared to existing methods. Evaluation on standard benchmarks reveals that MTTR significantly outperforms previous art across multiple metrics. In particular, MTTR shows impressive +5.7 and +5.0 mAP gains on the A2D-Sentences and JHMDB-Sentences datasets respectively, while processing 76 frames per second. In addition, we report strong results on the public validation set of Refer-YouTube-VOS, a more challenging RVOS dataset that has yet to receive the attention of researchers. The code to reproduce our experiments is avail-able at https://github.com/mttr2021/MTTR.

  • Research Article
  • 10.37675/jat.2025.00759
Explainable Crop Classification Using a BERT-Based Bidirectional Attention Multimodal Transformer
  • Dec 30, 2025
  • Academic Society for Appropriate Technology
  • Myeonghoon Kim + 3 more

Accelerating climate change and the intensifying global food security crisis have increased the importance of reliable crop classification across diverse environmental conditions. Existing crop classification models have primarily focused on improving accuracy by learning spectral and temporal patterns from satellite imagery; however, their black-box nature makes it difficult to understand the rationale behind each prediction, limiting their applicability in real-world agricultural decision-making. To address this issue, this study introduces a multimodal Transformer model that incorporates a BERTbased bidirectional attention mechanism, aiming to retain classification performance while enhancing interpretability. The proposed BERT Hybrid model employs a PVT backbone to extract spatial features from Sentinel-2 satellite imagery and integrates them with meteorological time-series embeddings; bidirectional self-attention is then used to jointly model cross-temporal and cross-modal interactions. We further conduct comparative experiments under the same conditions as the MMST-ViT(Multi-Modal Spatial-Temporal Vision Transformer) baseline, evaluating not only overall accuracy but also temporal attention patterns across crop growth stages and the relative importance of different weather variables. Experimental results show that bidirectional attention alleviates excessive focus on specific timestamps or single variables, producing more consistent and interpretable attention distributions. This study highlights the performance– interpretability trade-off in multimodal agricultural AI models and provides a foundation for building trustworthy deeplearning systems for crop monitoring. In addition, because the proposed approach relies solely on globally accessible Sentinel-2 satellite imagery and publicly available meteorological data, it demonstrates the potential for constructing large-scale crop monitoring systems at low cost, aligning with the principles of appropriate technology.

  • Research Article
  • Cite Count Icon 21
  • 10.1016/j.isci.2023.108320
Multimode microdimer robot for crossing tissue morphological barrier
  • Oct 28, 2023
  • iScience
  • Haocheng Wang + 10 more

Multimode microdimer robot for crossing tissue morphological barrier

  • Research Article
  • Cite Count Icon 459
  • 10.1109/tcsvt.2019.2947482
Multimodal Transformer With Multi-View Visual Representation for Image Captioning
  • Oct 25, 2019
  • IEEE Transactions on Circuits and Systems for Video Technology
  • Jun Yu + 3 more

Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN)-based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning. Compared to existing image captioning approaches, the MT model simultaneously captures intra- and inter-modal interactions in a unified attention block. Due to the in-depth modular composition of such attention blocks, the MT model can perform complex multimodal reasoning and output accurate captions. Moreover, to further improve the image captioning performance, multi-view visual features are seamlessly introduced into the MT model. We quantitatively and qualitatively evaluate our approach using the benchmark MSCOCO image captioning dataset and conduct extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method significantly outperforms the previous state-of-the-art methods. With an ensemble of seven models, our solution ranks the 1st place on the real-time leaderboard of the MSCOCO image captioning challenge at the time of the writing of this paper.

  • Conference Article
  • Cite Count Icon 2
  • 10.23919/iccas55662.2022.10003914
Multi-modal Transformer for Indoor Human Action Recognition
  • Nov 27, 2022
  • Jeonghyeok Do + 1 more

Indoor human action recognition is used in various fields. For example, we can use it to recognize exercise movements in the fitness industry, which can significantly help improve the health of modern people. With the development of sensors, it has become possible to easily acquire multiple data modalities of RGB, IR, depth, and skeleton in the same scene. Since each data modality is complementary, proper fusion is beneficial in recognizing human action. However, existing studies have limitations in utilizing the advantages of each modality. Therefore, we propose a Multi-Modal Transformer (MMT) to use RGB and skeleton data simultaneously in this work. Using the transformer-based structure, MMT can capture the correlation between non-local joints in skeleton data modality. In addition, MMT does not require additional training phases or multiple trained networks as the number of people on the scene changes. In experiments on public benchmark datasets, MMT shows comparable results using only eight input frames.

  • Research Article
  • 10.1158/1557-3265.sabcs24-ps11-08
Abstract PS11-08: MRI improves multi-modal AI system for breast cancer diagnosis and prognosis
  • Jun 13, 2025
  • Clinical Cancer Research
  • Yanqi Xu + 8 more

Background: MRI is the most sensitive imaging modality for breast cancer detection and is not affected by breast density. Screening MRI has higher specificity than mammography in high-risk populations, including women with a family history of breast cancer, BRCA1/2 mutations, and a personal history of breast cancer. The ACS screening guidelines recommend MRI supplemented with mammography for women at high risk (≥ 20%-25% lifetime risk). MRI is also used for diagnosing breast cancer when mammography and ultrasound are inconclusive. We investigate how MRI can improve cancer detection and risk prediction with a multi-modal AI system. Current standard-of-care risk models, such as the TC model, rely solely on clinical variables and do not account for the rich information in imaging data. Other existing AI systems typically analyze single imaging modality, usually mammography. Our multi-modal transformer (MMT) learns from longitudinal imaging data of multiple modalities, FFDM, DBT, US and MRI. Methods: We utilized the NYU Multimodal Breast Cancer Dataset, comprising 1,372,455 exams from 298,670 patients (age 30-108, mean 56.55 years, SD 12.00 years) between 2010 and 2022, for MMT training and evaluation. Our objective is to predict whether a patient currently has cancer and, if not, assess the risk of developing cancer in the future, incorporating data from all available, present and prior, breast imaging. Our method involves three steps: (1) training modality-specific feature extractors separately to generate image-level and patch-level feature embeddings; (2) combining image embeddings with additional variables including age, modality, study date and view; (3) feeding the combined embeddings into a transformer for cancer prediction. The model outputs two predictions, the patient's probability of having cancer and the patient’s risk of getting cancer within 5 years. Results We evaluated our model on a subgroup of patients who had at least one MRI in their records. The MMT model achieved an AUROC of 0.943 (95% CI: 0.935, 0.950) for cancer detection and 0.796 (95% CI: 0.765, 0.826) for 5-year risk prediction across all modalities. We separately compared our model’s AUROC on non-MRI exams and MRI exams with the corresponding baselines. For non-MRI exams, the MMT model with MRI data, achieved an AUROC of 0.939 (95% CI: 0.929, 0.948) for cancer detection and 0.778 (95% CI: 0.742, 0.810) for 5-year risk prediction, which improved the baseline MMT model without MRI by 0.024 and 0.044 (two-sided DeLong’s test, P < 0.01 for both) respectively. These results demonstrate that incorporating MRI improves both cancer detection and risk prediction for non-MRI exams. For MRI exams, the MMT model achieved an AUROC of 0.947 (95% CI: 0.934, 0.958) for cancer detection, improving by 0.029 (two-sided Delong’s test, P < 0.01) compared to an MRI-only baseline. This indicates that including prior imaging enhances the effectiveness of MRI in detecting cancer. However, for risk prediction on MRI exams, there was no significant improvement (ΔAUROC 0.004: two-sided DeLong’s test, P = 0.94). Additionally, MMT’s risk prediction AUROC on MRI exams was lower than other modalities (0.719, 95% CI: 0.615, 0.813), suggesting that MRI alone has less predictive power for future risk. Citation Format: Yanqi Xu, Jungkyu Park, Yiqiu Shen, Frank Yeung, Joe Cappadona, Jan Witowski, Linda Pak, Freya Schnabel, Krzysztof J. Geras. MRI improves multi-modal AI system for breast cancer diagnosis and prognosis [abstract]. In: Proceedings of the San Antonio Breast Cancer Symposium 2024; 2024 Dec 10-13; San Antonio, TX. Philadelphia (PA): AACR; Clin Cancer Res 2025;31(12 Suppl):Abstract nr PS11-08.

  • Research Article
  • Cite Count Icon 40
  • 10.1109/tpami.2023.3328185
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding.
  • Feb 1, 2024
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Fengyuan Shi + 3 more

Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the existing encoder-only grounding framework (e.g., TransVG) suffers from heavy computation due to the self-attention operation with quadratic time complexity. To address this issue, we present a new multimodal transformer architecture, coined as Dynamic Mutilmodal detection transformer (DETR) (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases. The key observation is that there exists high spatial redundancy in images. Thus, we devise a new dynamic multimodal transformer decoder by exploiting this sparsity prior to speed up the visual grounding process. Specifically, our dynamic decoder is composed of a 2D adaptive sampling module and a text guided decoding module. The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features. These two modules are stacked alternatively to gradually bridge the modality gap and iteratively refine the reference point of grounded object, eventually realizing the objective of visual grounding. Extensive experiments on five benchmarks demonstrate that our proposed Dynamic MDETR achieves competitive trade-offs between computation and accuracy. Notably, using only 9% feature points in the decoder, we can reduce ∼44% GFLOPs of the multimodal transformer, but still get higher accuracy than the encoder-only counterpart. With the same number of encoder layers as TransVG, our Dynamic MDETR (ResNet-50) outperforms TransVG (ResNet-101) but only brings marginal extra computational cost relative to TransVG. In addition, to verify its generalization ability and scale up our Dynamic MDETR, we build the first one-stage CLIP empowered visual grounding framework, and achieve the state-of-the-art performance on these benchmarks.

  • Research Article
  • 10.1177/14727978251374335
RETRACTED: Intelligent digital art design fusion platform based on multimodal transformation
  • Sep 3, 2025
  • Journal of Computational Methods in Sciences and Engineering
  • Xiao Li

With the continuous advancement of artificial intelligence technology, intelligent digital art design gradually integrates multi-modal data, such as images, text, and audio, in the creative process, improving the creativity and efficiency of design. Traditional art design platforms have problems such as insufficient information fusion and low creative efficiency when dealing with multi-modal data. In order to solve these challenges, this paper proposes an intelligent digital art design fusion platform based on multi-modal Transformer, which effectively fuses data of different modalities through the multi-modal Transformer architecture to improve creative efficiency and work quality. The proposed multi-modal Transformer framework is a novel approach to digital art creation, overcoming traditional limitations of single-modal platforms by integrating image, text, and audio. This multi-modal fusion significantly enhances creative efficiency by 33.3% and creativity by improving both the diversity and expressiveness of the generated artworks, primarily within a unified image resolution framework. However, the current platform is optimized for fixed resolution image generation. Handling mixed resolutions or modified images (in pixel or grid formats) presents challenges, particularly in maintaining output integrity. This limitation is recognized and will be addressed in future work through adaptive resolution techniques. The innovation lies in effectively leveraging the self-attention mechanism to balance the computational load while enriching creative outputs, addressing both artistic and technological challenges. Specific data analysis shows that the average time of three-modal fusion creation design is 80 min, while that of single-modal creation is 120 min, which proves the significant advantages of multi-modal fusion in accelerating design creation. In addition, the platform has also achieved good results in the quality of creation, and the creative score has increased by about 25% compared with the traditional platform.

  • Research Article
  • Cite Count Icon 25
  • 10.1021/acsami.4c01207
Multimodal Transformer for Property Prediction in Polymers.
  • Mar 19, 2024
  • ACS Applied Materials & Interfaces
  • Seunghee Han + 5 more

In this work, we designed a multimodal transformer that combines both the Simplified Molecular Input Line Entry System (SMILES) and molecular graph representations to enhance the prediction of polymer properties. Three models with different embeddings (SMILES, SMILES + monomer, and SMILES + dimer) were employed to assess the performance of incorporating multimodal features into transformer architectures. Fine-tuning results across five properties (i.e., density, glass-transition temperature (Tg), melting temperature (Tm), volume resistivity, and conductivity) demonstrated that the multimodal transformer with both the SMILES and the dimer configuration as inputs outperformed the transformer using only SMILES across all five properties. Furthermore, our model facilitates in-depth analysis by examining attention scores, providing deeper insights into the relationship between the deep learning model and the polymer attributes. We believe that our work, shedding light on the potential of multimodal transformers in predicting polymer properties, paves a new direction for understanding and refining polymer properties.

  • Conference Article
  • Cite Count Icon 9
  • 10.1109/icassp43922.2022.9746660
Distributed Audio-Visual Parsing Based On Multimodal Transformer and Deep Joint Source Channel Coding
  • May 23, 2022
  • Penghong Wang + 3 more

Audio-visual parsing (AVP) is a newly emerged multimodal perception task, which detects and classifies audio-visual events in video. However, most existing AVP networks only use a simple attention mechanism to guide audio-visual multimodal events, and are implemented in a single end. This makes it unable to effectively capture the relationship between audio-visual events, and is not suitable for implementation in the network transmission scenario. In this paper, we focus on these problems and propose a distributed audio-visual parsing network (DAVPNet) based on multimodal transformer and deep joint source channel coding (DJSCC). Multimodal transformers are used to enhance the attention calculation between audio-visual events, and DJSCC is used to apply DAVP tasks to network transmission scenarios. Finally, the Look, Listen, and Parse (LLP) dataset is used to test the algorithm performance, and the experimental results show that the DAVPNet has superior parsing performance.

  • Research Article
  • Cite Count Icon 48
  • 10.1109/jbhi.2023.3286689
Multimodal Transformer of Incomplete MRI Data for Brain Tumor Segmentation.
  • Jan 1, 2024
  • IEEE Journal of Biomedical and Health Informatics
  • Hsienchih Ting + 1 more

Accurate segmentation of brain tumors plays an important role for clinical diagnosis and treatment. Multimodal magnetic resonance imaging (MRI) can provide rich and complementary information for accurate brain tumor segmentation. However, some modalities may be absent in clinical practice. It is still challenging to integrate the incomplete multimodal MRI data for accurate segmentation of brain tumors. In this paper, we propose a brain tumor segmentation method based on multimodal transformer network with incomplete multimodal MRI data. The network is based on U-Net architecture consisting of modality specific encoders, multimodal transformer and multimodal shared-weight decoder. First, a convolutional encoder is built to extract the specific features of each modality. Then, a multimodal transformer is proposed to model the correlations of multimodal features and learn the features of missing modalities. Finally, a multimodal shared-weight decoder is proposed to progressively aggregate the multimodal and multi-level features with spatial and channel self-attention modules for brain tumor segmentation. A missing-full complementary learning strategy is used to explore the latent correlation between the missing and full modalities for feature compensation. For evaluation, our method is tested on the multimodal MRI data from BraTS 2018, BraTS 2019 and BraTS 2020 datasets. The extensive results demonstrate that our method outperforms the state-of-the-art methods for brain tumor segmentation on most subsets of missing modalities.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-981-19-8746-5_7
Investigation of Explainability Techniques for Multimodal Transformers
  • Jan 1, 2022
  • Krithik Ramesh + 1 more

Multimodal transformers such as CLIP and ViLBERT have become increasingly popular for visiolinguistic tasks as they have an efficient and generalizable understanding of visual features and labels. Notable examples of visiolinguistic models include OpenAI’s CLIP by Radford et al. and VilBERT by Lu et al. One of the gaps in current multimodal transformers is that there are no unified explainability frameworks to compare attention interactions meaningfully between models. To address the comparability concern, we investigate two different explainability frameworks. Specifically, Label Attribution and Optimal Transport of Vision-Language semantic spaces with the VisualBERT multimodal transformer model provide an interpretability process towards understanding attention interactions in multimodal transformers. We provide a case study of the Visual Genome and Question Answer 2 Datasets trained using VisualBERT.KeywordsMultimodal transformersLabel attributionOptimal transport

  • Research Article
  • Cite Count Icon 5
  • 10.1016/j.comcom.2023.12.014
Network security situation assessment and prediction method based on multimodal transformation in edge computing
  • Dec 21, 2023
  • Computer Communications
  • Meng Xu + 2 more

Network security situation assessment and prediction method based on multimodal transformation in edge computing

  • Conference Article
  • Cite Count Icon 1
  • 10.1117/12.2643741
Multi-modal transformer for video retrieval using improved sentence embeddings
  • Oct 12, 2022
  • Zhi Liu + 2 more

With the explosive growth of the number of online videos, video retrieval becomes increasingly difficult. Multi-modal visual and language understanding based video-text retrieval is one of the mainstream framework to solve this problem. Among them, MMT (Multi-modal Transformer) is a novel and mainstream model. On the language side, BERT (Bidirectional Encoder Representation for Transformers) is used to encode text, where the pretrained BERT will be fine tuned during training. However, there exists a mismatch in this stage. The pre-training tasks of BERT is based on NSP (Next Sentence Prediction) and MLM(masked language model) which have weak correlation with video retrieval. For text encoder will encode text into semantic embeddings. On the visual side, Transformer is used to aggregate multimodal experts of videos. We find that the output of visual transformer is not fully utilized. In this paper, a sentence- BERT model is introduced to substitute BERT model in MMT to improve sentence embeddings efficiency. In addition, a max-pooling layer is adopted after Transformer to improve the utilization efficiency of the output of the model. Experiment results show that the proposed model outperforms MMT.

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.sna.2024.115952
Design of multimodal transformable wheels for amphibious robotic vehicles
  • Oct 3, 2024
  • Sensors and Actuators: A. Physical
  • Zhangyuan Wang + 3 more

Design of multimodal transformable wheels for amphibious robotic vehicles

Save Icon
Up Arrow
Open/Close