Abstract
Image captioning is a fascinating and fast-evolving research project that integrates two domains: Natural Language Processing and Computer Vision. Creating appropriate captions is a difficult task due to the many activities portrayed in the backdrop image. To mitigate these drawbacks, the envisioned work employs a ResNet50 encoder for image feature extraction and a Hybrid LSTM–GRU decoder optimized with Beam Search to produce text descriptions. Beam search is a search technique that enables caption generation with higher quality and consistency by investigating many paths in the search space and choosing the most likely option based on a score or probability. The findings compare CNN models such as VGG16, InceptionV3, ResNet50 and DenseNet121 with language model LSTM in terms of loss and accuracy on the Flickr8k dataset. To further boost the performance of caption quality, the proposed method uses ResNet50 + Hybrid LSTM–GRU with Beam search, which produces a good accuracy of 0.8932 and a lower loss of 0.4013 on the Flickr8k dataset. The proposed method, ResNet50 + hybrid LSTM–GRU with Beam Search, beats the findings of the aforementioned encoder–decoder models with Greedy Search in terms of the BLEU score of 0.6034.
Published Version
Join us for a 30 min session where you can share your feedback and ask us any queries you have