Attention-Aligned Transformer for Image Captioning

Zhengcong Fei

doi:10.1609/aaai.v36i1.19940

Abstract

Recently, attention-based image captioning models, which are expected to ground correct image regions for proper word generations, have achieved remarkable performance. However, some researchers have argued “deviated focus” problem of existing attention mechanisms in determining the effective and influential image features. In this paper, we present A2 - an attention-aligned Transformer for image captioning, which guides attention learning in a perturbation-based self-supervised manner, without any annotation overhead. Specifically, we add mask operation on image regions through a learnable network to estimate the true function in ultimate description generation. We hypothesize that the necessary image region features, where small disturbance causes an obvious performance degradation, deserve more attention weight. Then, we propose four aligned strategies to use this information to refine attention weight distribution. Under such a pattern, image regions are attended correctly with the output words. Extensive experiments conducted on the MS COCO dataset demonstrate that the proposed A2 Transformer consistently outperforms baselines in both automatic metrics and human evaluation. Trained models and code for reproducing the experiments are publicly available.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Attention-Aligned Transformer for Image Captioning

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Jun 28, 2022
Citations: 16

Similar Papers

RFE-SRN: Image-text similarity reasoning network based on regional feature enhancement
Xiaoyu Yang ... Guangqiang Yin
Neurocomputing | VOL. 518
Xiaoyu Yang, et. al.Xiaoyu Yang ... Guangqiang Yin
12 Nov 2022
Neurocomputing | VOL. 518

Consensus Graph Representation Learning for Better Grounded Image Captioning
Wenqiao Zhang ... Siliang Tang
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 35
Wenqiao Zhang, et. al.Wenqiao Zhang ... Siliang Tang
18 May 2021
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 35

VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning
Xiaowei Hu ... Lijuan Wang
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 35
Xiaowei Hu, et. al.Xiaowei Hu ... Lijuan Wang
18 May 2021
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 35

Context-Assisted Attention for Image Captioning
Zheng Lian ... Rui Wang
-
Zheng Lian, et. al.Zheng Lian ... Rui Wang
01 Jan 2021
01 Jan 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Attention-Aligned Transformer for Image Captioning

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence