Linguistically-aware attention for reducing the semantic gap in vision-language tasks

Gouthaman Kv,Athira Nambiar,Kancheti Sai Srinivas,Anurag Mittal

doi:10.1016/j.patcog.2020.107812

Abstract

Attention models are widely used in Vision-language (V-L) tasks to perform the visual-textual correlation. Humans perform such a correlation with a strong linguistic understanding of the visual world. However, even the best performing attention model in V-L tasks lacks such a high-level linguistic understanding, thus creating a semantic gap between the modalities. In this paper, we propose an attention mechanism - Linguistically-aware Attention (LAT) - that leverages object attributes obtained from generic object detectors along with pre-trained language models to reduce this semantic gap. LAT represents visual and textual modalities in a common linguistically-rich space, thus providing linguistic awareness to the attention process. We apply and demonstrate the effectiveness of LAT in three V-L tasks: Counting-VQA, VQA, and Image captioning. In Counting-VQA, we propose a novel counting-specific VQA model to predict an intuitive count and achieve state-of-the-art results on five datasets. In VQA and Captioning, we show the generic nature and effectiveness of LAT by adapting it into various baselines and consistently improving their performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Linguistically-aware attention for reducing the semantic gap in vision-language tasks

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition

Lead the way for us

Journal: Pattern Recognition	Publication Date: Jan 1, 2021
Citations: 10

Similar Papers

A Survey on Causal Inference in Image Captioning
Jungeun Kim ... Junyeong Kim
-
Jungeun Kim, et. al.Jungeun Kim ... Junyeong Kim
05 Feb 2023
05 Feb 2023

Language Features Matter: Effective Language Representations for Vision-Language Tasks
Andrea Burns ... Bryan Plummer
-
Andrea Burns, et. al.Andrea Burns ... Bryan Plummer
01 Oct 2019
01 Oct 2019

Towards Explainable Deep Learning for Image Captioning through Representation Space Perturbation
Sofiane Elguendouze ... Adel Hafiane
-
Sofiane Elguendouze, et. al.Sofiane Elguendouze ... Adel Hafiane
18 Jul 2022
18 Jul 2022

Deconfounded Image Captioning: A Causal Retrospect.
Xu Yang ... Hanwang Zhang
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 45
Xu Yang, et. al.Xu Yang ... Hanwang Zhang
01 Jan 2021
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 45

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Linguistically-aware attention for reducing the semantic gap in vision-language tasks

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition