Image Captioning: A Comprehensive Review

Yash Vardhan Mishra

doi:10.22214/ijraset.2024.58298

Abstract

Abstract: The Image Caption Generator is an intriguing project that bridges the domains of computer vision and natural language processing, aiming to automatically generate descriptive text for images. In the last decade, considerable progress has been made, yet key challenges in optimising Convolutional Neural Networks (CNNs) for precise image feature extraction and refining Long Short-Term Memory (LSTM) networks for coherent text generation persist. Challenges include overfitting, limited context understanding, and difficulties in modelling long-range dependencies. Solutions have emerged in the form of attention mechanisms, which help focus on relevant image regions, and advanced LSTM architectures like the use of Gated Recurrent Units (GRUs) to improve sequential modelling. These innovations have significantly enhanced the overall performance, leading to more accurate and contextually relevant image captions, ultimately advancing the synergy between computer vision and natural language processing in the realm of image captioning

Full Text