Evaluating Precision and Context in Image Captions

Chaitrali Prasanna Chaudhari,Satish Devane

doi:10.1109/icaccs51430.2021.9441774

Abstract

Captioning images is the process of generating the image description in a textual format that has wide applications in various domains. The research in this area is ongoing and seeking increasing attention as it merges the fields of computer vision and natural language processing. In the presented work, an image caption generation model consisting of a probabilistic and neural framework is built following an encoder-decoder scheme. The image is inputted to the deep learning classifier called Convolutional Neural Network (CNN) and to generate a set of sentences, a Long-Short Term Memory (LSTM) model is used. The captions generated by the proposed captioning model are compared over the other conventional models with respect to statistical analysis using standard and popular similarity metrics called Cosine Similarity, Precision and Recall. The results of evaluation of the predicted image captions compared to the actual captions are presented and the applicability of evaluation methods in context of image captioning is discussed and justified.

Full Text