A Comparative Study of Machine Learning Based Image Captioning Models

Priya Singh,Hardik Jain,Piyush Gupta

doi:10.1109/icoei53556.2022.9777153

Abstract

Automated image captioning is a crucial concept for numerous real-world applications as it is useful in robotics, image indexing, self-driving vehicles and greatly helpful for impaired eyesight people. An image provided in real-time can be converted into text using image captioning models developed by machine learning algorithms. Understanding an image mostly depends on the features of the image. Machine learning techniques are widely used for image captioning tasks. This research study has performed a comparative analysis on three Machine Learning (ML) algorithms, i.e. k-Nearest Neighbor (KNN), Convolution Neural Network (CNN) with Long Short Term Memory (LSTM) and Attention Based LSTM. In addition, an improved KNN algorithm with reduced time complexity and an improved CNN with LSTM and Attention Based LSTM model with an added beam search method is proposed to improve the underlying approaches further. The performance of the three selected models are empirically evaluated using BLEU, ROUGE and METEOR scores on the widely used flickr8k dataset, and the experimental results demonstrate the supremacy of the Attention Based LSTM over the other two approaches. Finally, the current study's findings help guide the researchers and practitioners in selecting the appropriate approach for Image Captioning with empirical evidence in terms of standard evaluation metrics.

Full Text