Abstract

Internal relationship exploring based on the descriptive region features of image objects and grid features have contributed significantly to the development of image captioning, especially when combined with Transformer architecture. However, when conducting self-attention calculation, most of these methods only consider the relationship intra-objects and ignore the connection between entities and background. Besides, the way of exploring the relation information inside the image can also be extended. In this paper, we introduce a novel Mixed Knowledge Relation Transformer (MKRT) to explore the relationship between objects from both internal attribute relationship and external the object-verb-subject relationship. Furthermore we embed the important image background information into the relation module. In MKRT, the semantic relation obtained from the external knowledge is incorporated into the relation modeling in the novel Mixed Knowledge Relation Attention (MKRA). To validate the effectiveness of our model, we conduct extensive experiments on the most popular MSCOCO dataset, and achieve 134.5 CIDEr score on the offline test split and 133.5 CIDEr (c40) score on the official online testing server.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call