Abstract

Region features play a huge role in image captioning. However, obtaining region features requires pre-training an object detector by using a large number of object detection datasets. It may be impossible for the end-to-end training. And if there is a large distribution difference between the object detection datasets and the image captioning datasets, the object detector may not be able to extract accurate region features. This makes it limited in application. In this paper, we propose a clustering-based mask recovery for image captioning. In the encoder, the pseudo-region features are obtained by clustering the grid features, which are extracted using Swin Transformer. Then we input the grid features together with the pseudo-region features into the decoder, and make the model to dynamically learns the weights of the two features in the decoding process to minimize the effect of errors caused by clustering. By using a clustering method to generate pseudo-region features for images, not only does the training process become end-to-end, but there is no need to introduce additional object detection datasets to train the object detector. In addition, the Transformer decoder has a misplaced problem in the decoding process. This means that the positional information used by the model when generating a word is not the same as the positional information used when it continues to use the word to reason. This may have some negative impact on the position encoding of the model. Therefore, we changed the original decoding method to mask recovery. Furthermore, a masked multi-head attention module with relative position is proposed in the decoder to integrate the information in the fusion features, and reconstruct the relative position relationship between words. We conduct experiments on MSCOCO 2014 dataset. The experiment results show that our model obtains 144.3% (single model) and 147.0% (ensemble of 4 models) CIDEr scores on ‘Karpathy’ offline test split, and 143.2% (c40) CIDEr scores on the official online test server.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call