Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

Hila Chefer,Lior Wolf,Shir Gur

doi:10.1109/iccv48922.2021.00045

Abstract

Transformers are increasingly dominating multi-modal reasoning tasks, such as visual question answering, achieving state-of-the-art results thanks to their ability to contextualize information using the self-attention and co-attention mechanisms. These attention modules also play a role in other computer vision tasks including object detection and image segmentation. Unlike Transformers that only use self-attention, Transformers with co-attention require to consider multiple attention maps in parallel in order to highlight the information that is relevant to the prediction in the model’s input. In this work, we propose the first method to explain prediction by any Transformer-based architecture, including bi-modal Transformers and Transformers with co-attentions. We provide generic solutions and apply these to the three most commonly used of these architectures: (i) pure self-attention, (ii) self-attention combined with co-attention, and (iii) encoder-decoder attention. We show that our method is superior to all existing methods which are adapted from single modality explainability. Our code is available at: https://github.com/hila-chefer/Transformer-MM-Explainability.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly
Spencer Whitehead ... Marcus Rohrbach
-
Spencer Whitehead, et. al.Spencer Whitehead ... Marcus Rohrbach
01 Jan 2021
01 Jan 2021

Advancing Accuracy in Multimodal Medical Tasks Through Bootstrapped Language-Image Pretraining (BioMedBLIP): Performance Evaluation Study.
Usman Naseem ... Anum Masood
JMIR medical informatics | VOL. 12
Usman Naseem, et. al.Usman Naseem ... Anum Masood
05 Aug 2024
JMIR medical informatics | VOL. 12

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal ... Tejas Khot
International Journal of Computer Vision | VOL. 127
Yash Goyal, et. al.Yash Goyal ... Tejas Khot
11 Sep 2018
International Journal of Computer Vision | VOL. 127

MobiVQA
Qingqing Cao ... Nicholas D. Lane
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies | VOL. 6
Qingqing Cao, et. al.Qingqing Cao ... Nicholas D. Lane
04 Jul 2022
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

Abstract

Talk to us

Similar Papers