Abstract

Image captioning has become one of the most popular research problems in the field of artificial intelligence. Although many studies have achieved excellent results, there still are some challenges, for example, cross-modal feature alignment lacks explicit guidance, and model-generated sentences contain grammatical errors. In this paper, we propose a relationship-aligned and grammar-wise BERT model, which integrates a relationship exploration module and a grammar enhancement module into the BERT-based model. Specifically, in the relationship exploration module, to explore relationship tags as anchors to guide semantic alignment, we design a network to calculate the cosine similarity between visual features and word vector information. We construct the grammar enhancement module similarly to the BERT. That means we use two BERT modules in our framework. The first is the main frame for generating captions, and the second is the auxiliary model to determine whether the syntax of the generated caption is correct. To validate the performance of our proposed model, we conduct abundant experiments on the MSCOCO dataset, Flickr30k dataset, and Flickr8k dataset. Experimental results show that our proposed method performs better than state-of-the-art approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call