Visual Linguistic Model and Its Applications in Image Captioning

Ravin Kumar

doi:10.1007/s42979-020-00135-w

Abstract

Image captioning is a well-known task of generating textual description of a given image. Research work on this problem statement requires efforts in both computer vision and natural language processing domains to obtain better quality image descriptions. In this paper, we are proposing a new deep learning approach to generate image captions. In this approach, we generate a sequence of visual embeddings for objects and their relationships present in the image. These visual embeddings are arranged in a particular manner and are then supplied to the encoder part of an attention-based sequence-to-sequence model. In the final step, we receive the generated image captions from the decoder part of our sequence-to-sequence model. We tested its performance on MSCOCO Dataset, and the obtained results suggested that our model generates better image captions for MSCOCO testing dataset.

Full Text