Abstract

• This paper present an attention-based method for image captioning that uses contrastive learning.The proposed model uses a transformer-based architecture, with an image encoder section and a text decoder section. The image encoder uses a feature attention block and residual-like structure to focus on particular features. • The loss function for the proposed method is a combination of cross-entropy loss and contrastive learning loss, making the generation of results more diverse. • This paper shown to exhibit state-of-the-art performances on well-known datasets. Image captioning is a significant step toward achieving automatic interactions between humans and computers, in which a textual sequence of the content of an image is generated. Recently, the transformer-based encoder–decoder paradigm has made great achievements in image captioning. This method is usually trained with a cross-entropy loss function. However, for various captions of images with the same meaning, the computed losses may be different. The result is that the descriptions of images tend to be consistent, which limits the diversity of image captioning. In this paper, we present an attention-reinforced transformer, a transformer-based architecture for image captioning. The architecture improves the image encoding stage, which exploits the relationships between image regions by integrating a feature attention block (FAB). During the training phase, we trained the model with a combination of cross-entropy loss and contrastive loss. We experimentally explored the performance of ArCo and other fully attentive models. We also validated the baseline of the transformer for image captioning with different pre-trained models. Our proposed approach was demonstrated to achieve a new state-of-the-art performance on the offline ‘Karpathy’ test split and online test server.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call