Transformer-based image captioning by leveraging sentence information

Vahid Chahkandi,Mohammad Javad Fadaeieslam,Farzin Yaghmaee

doi:10.1117/1.jei.31.4.043005

Abstract

Although the autoregressive image captioning methods yield good-quality image descriptions, their sequential structures slow down the speed of sentence generation processes. With a view to overcome these shortcomings, some nonautoregressive models have been proposed, but the quality of sentences produced by them is lower than those obtained in autoregressive methods. We have designed a new structure based on nonautoregressive methods to not only find better relations between sentence words and image salient objects but also combine this information with some positional information, extracted from the sentence, to generate a more qualified target sentence. The experimental results on the standard benchmark show that our proposed model achieves performance better than general nonautoregressive captioning models.

Full Text