Abstract

Image captioning attracts much attention as it bridges computer vision and natural language processing. Recent works show that transformer-based models with the multi-head self-attention can explore intra-modal interactions for generating high-quality image captions. However, the subspace of each attention head is operated independently in these multi-head attention methods, which ignores the association between attention heads and makes the learning of intra-modal interaction incomplete. In this paper, we propose a Multi-head Association Attention Enhancement Network (MAENet) for image captioning, which leverages a novel Multi-head Association Attention Enhancement (MAE) block for completing intra-modal interaction learning. The proposed MAE block contains Multi-head Association Attention (MAA) and Attention Enhancement (AE) module.The MAA calculates the contributive weight of different attention heads, and captures the associated information from adjacent attention subspaces via learned associative parameters. The AE module follows with the MAA to further enhance the association attention results through an additional spatial and channel-wise attention aggregation. It’s worth noting that the MAE block is a plug-and-play module that can be cascaded with other multi-head attention mechanisms. Extensive experiments on MS COCO show that our model achieves a quite competitive performance, especially for the model of MAE block cascaded with X-linear attention obtains the best-reported SPICE performance of 23.5% on the Karpathy test split. This clearly demonstrates that the proposed model can better model the interactive information and result in superior captions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call