Captioning an image involves using a combination of vision and language models to describe the image in an expressive and concise sentence. Successful captioning task requires extracting as much information as possible from the corresponding image. One of these key pieces of information is the topic to which the image belongs. The state-of-the-art methods used topic modeling depending only on caption text in order to extract these topics. The problem with extracting the topics using topic modeling only on caption text is that it lacks the consideration of the image’s semantic information. Instead, concept modeling extracts the concepts directly from the images in addition to considering the corresponding caption text. Concept modeling can be used in image captioning to extremely capture the image contexts and benefit from it to produce more accurate descriptions. In this paper, novel image captioning models are proposed by utilizing the concept modeling technique. The first concept-based model is proposed by utilizing LSTM as a decoder while the second model is proposed in association with new multi-encoder transformer architecture. Standard metrics have been used to evaluate the proposed models using Microsoft COCO and Flickr30K datasets. The proposed models outperformed the related work methods with reduced computational complexity.
Read full abstract