What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations.

Nikolai Ilinykh,Simon Dobnik

doi:10.3389/frai.2021.767971

Abstract

Neural networks have proven to be very successful in automatically capturing the composition of language and different structures across a range of multi-modal tasks. Thus, an important question to investigate is how neural networks learn and organise such structures. Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks. However, very few have explored what structures are acquired by multi-modal transformers where linguistic and visual features are combined. It is critical to understand the representations learned by each modality, their respective interplay, and the task’s effect on these representations in large-scale architectures. In this paper, we take a multi-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream. Our results indicate that the information about different relations between objects in the visual stream is hierarchical and varies from local to a global object-level understanding of the image. In particular, while visual representations in the first layers encode the knowledge of relations between semantically similar object detections, often constituting neighbouring objects, deeper layers expand their attention across more distant objects and learn global relations between them. We also show that globally attended objects in deeper layers can be linked with entities described in image descriptions, indicating a critical finding - the indirect effect of language on visual representations. In addition, we highlight how object-based input representations affect the structure of learned visual knowledge and guide the model towards more accurate image descriptions. A parallel question that we investigate is whether the insights from cognitive science echo the structure of representations that the current neural architecture learns. The proposed analysis of the inner workings of multi-modal transformers can be used to better understand and improve on such problems as pre-training of large-scale multi-modal architectures, multi-modal information fusion and probing of attention weights. In general, we contribute to the explainable multi-modal natural language processing and currently shallow understanding of how the input representations and the structure of the multi-modal transformer affect visual representations.

Highlights

The ability of transformers to capture contextualised representations and encode long-term relations has led to their successful application in various NLP tasks (Vaswani et al, 2017; Devlin et al, 2019; Radford et al, 2019)
We focus on the vision stream and inspect 1) how visual knowledge is represented in a transformer as exemplified by self-attention, 2) how visual knowledge is affected by the overall training task, which is image caption generation, and 3) whether the observed attentional patterns are intuitively interpretable to us
A large number of papers has focused on the analysis of representations captured by uni-modal architectures, e.g. BERT (Devlin et al, 2019)

Summary

Introduction

The ability of transformers to capture contextualised representations and encode long-term relations has led to their successful application in various NLP tasks (Vaswani et al, 2017; Devlin et al, 2019; Radford et al, 2019) Their large size, layer depth and numerous multi-head self-attention mechanisms are the main reasons for their excellent performance. Multiple explainability methods and tools have been proposed in the ‘BERTology’ field, which investigates whether transformers can learn helpful information In these approaches, selfattention is typically inspected for the presence of specific linguistic knowledge as a product of cognition. Vig and Belinkov (2019) show that more complex linguistic phenomena are captured in deeper attention heads of the model, building on top of much simpler knowledge present in earlier layers of the model. It has been recently shown that it is relatively easy to manipulate and corrupt gradient-based explainability methods (Wang et al, 2020)

Objectives

Methods

Findings

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in artificial intelligence

Lead the way for us

Journal: Frontiers in artificial intelligence	Publication Date: Dec 3, 2021
License type: CC BY 4.0

Similar Papers

Multimodal Transformer With Multi-View Visual Representation for Image Captioning
Jun Yu ... Jing Li
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 30
Jun Yu, et. al.Jun Yu ... Jing Li
25 Oct 2019
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 30

Hierarchical organization of social action features along the lateral visual pathway
Emalie Mcmahon ... Leyla Isik
Current Biology | VOL. 33
Emalie Mcmahon, et. al.Emalie Mcmahon ... Leyla Isik
01 Nov 2023
Current Biology | VOL. 33

Disentangling visual imagery and perception of real-world objects
Sue-Hyun Lee ... Dwight J Kravitz
NeuroImage | VOL. 59
Sue-Hyun Lee, et. al.Sue-Hyun Lee ... Dwight J Kravitz
22 Oct 2011
NeuroImage | VOL. 59

Zero-shot Video Emotion Recognition via Multimodal Protagonist-aware Transformer Network
Fan Qi ... Changsheng Xu
-
Fan Qi, et. al.Fan Qi ... Changsheng Xu
17 Oct 2021
17 Oct 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in artificial intelligence