Image captioning for effective use of language models in knowledge-based visual question answering

Ander Salaberria,Gorka Azkune,Oier Lopez De Lacalle,Aitor Soroa,Eneko Agirre

doi:10.1016/j.eswa.2022.118669

Ander Salaberria, Gorka Azkune + Show 3 more

Open Access

https://doi.org/10.1016/j.eswa.2022.118669

Copy DOI

Abstract

Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. More specifically, we verbalize the image contents and allow language models to better leverage their implicit knowledge to solve knowledge-intensive tasks. Focusing on a visual question answering task which requires external knowledge (OK-VQA), our contributions are: (i) a text-only model that outperforms pretrained multimodal (image-text) models of comparable number of parameters; (ii) confirmation that our text-only method is specially effective for tasks requiring external knowledge, as it is less effective in standard a VQA task (VQA 2.0); and (iii) our method attains results in the state-of-the-art when increasing the size of the language model. We also significantly outperform current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Expert Systems with Applications	Publication Date: Aug 28, 2022
Citations: 26	License type: cc-by

R Discovery Prime

R Discovery Prime

Image captioning for effective use of language models in knowledge-based visual question answering

Abstract

Talk to us

Similar Papers

More From: Expert Systems with Applications

Lead the way for us

Similar Papers

Improving Automatic VQA Evaluation Using Large Language Models
Oscar Mañas ... Aishwarya Agrawal
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Oscar Mañas, et. al.Oscar Mañas ... Aishwarya Agrawal
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Neural Transfer Learning For Vietnamese Sentiment Analysis Using Pre-trained Contextual Language Models
An Pha Le ... Tran Vu Pham
-
An Pha Le, et. al.An Pha Le ... Tran Vu Pham
16 Dec 2021
16 Dec 2021

VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models
Ziyi Yin ... Jinghui Chen
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Ziyi Yin, et. al.Ziyi Yin ... Jinghui Chen
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Towards an Enhanced Understanding of Bias in Pre-trained Neural Language Models: A Survey with Special Emphasis on Affective Bias
Anoop K ... Lajish V L
-
Anoop K, et. al. Anoop K ... Lajish V L
01 Jan 2021
01 Jan 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Image captioning for effective use of language models in knowledge-based visual question answering

Abstract

Talk to us

Similar Papers

More From: Expert Systems with Applications