Replicating the innate human capability to understand image content and create descriptive text is a formidable challenge for machines. The application of image captioning is highly versatile and holds great importance. It involves the creation of concise and descriptive captions for images using a variety of advanced techniques such as Natural Language Processing (NLP), Computer Vision (CV), and Deep Learning (DL). These techniques enable the automated generation of accurate and meaningful captions, which have numerous practical applications across industries such as healthcare, entertainment, marketing to mention but few. This study employed the machine learning algorithms of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) to train computer vision using the Hibs dataset. The model leveraged input images with associated captions to identify and analyze predicted captions within the computer system. Model evaluation is a step in assessing the model's performance. This involves metrics such as BLEU score and METEOR score to measure the quality of the model's output in comparison to reference translation. These metrics help in quantifying the accuracy and fluency of the generated text, and the results are then presented to provide insights into the model's performance.
Read full abstract