External knowledge-assisted Transformer for image captioning

Zhixin Li,Qiang Su,Tianyu Chen

doi:10.1016/j.imavis.2023.104864

Abstract

Internal relationship exploring based on the descriptive region features of image objects and grid features have contributed significantly to the development of image captioning, especially when combined with Transformer architecture. However, when conducting self-attention calculation, most of these methods only consider the relationship intra-objects and ignore the connection between entities and background. Besides, the way of exploring the relation information inside the image can also be extended. In this paper, we introduce a novel Mixed Knowledge Relation Transformer (MKRT) to explore the relationship between objects from both internal attribute relationship and external the object-verb-subject relationship. Furthermore we embed the important image background information into the relation module. In MKRT, the semantic relation obtained from the external knowledge is incorporated into the relation modeling in the novel Mixed Knowledge Relation Attention (MKRA). To validate the effectiveness of our model, we conduct extensive experiments on the most popular MSCOCO dataset, and achieve 134.5 CIDEr score on the offline test split and 133.5 CIDEr (c40) score on the official online testing server.

Full Text