Abstract
In recent years, there has been growing interest among researchers in the field of image captioning, which involves generating one or more descriptions for an image that closely resembles a human-generated description. Most of the existing studies in this area focus on the English language, utilizing CNN and RNN variants as encoder and decoder models, often enhanced by attention mechanisms. Despite Bengali being the fifth most-spoken native language and the seventh most widely spoken language, it has received far less attention in comparison to resource-rich languages like English. This study aims to bridge that gap by introducing a novel approach to image captioning in Bengali. By leveraging state-of-the-art Convolutional Neural Networks such as EfficientNetV2S, ConvNeXtSmall, and InceptionResNetV2 along with an improvised Transformer, the proposed system achieves both computational efficiency and the generation of accurate, contextually relevant captions. Additionally, Bengali text-to-speech synthesis is incorporated into the framework to assist visually impaired Bengali speakers in understanding their environment and visual content more effectively. The model has been evaluated using a chimeric dataset, combining Bengali descriptions from the Ban-Cap dataset with corresponding images from the Flickr 8k dataset. Utilizing EfficientNet, the proposed model attains METEOR, CIDEr, and ROUGE scores of 0.34, 0.30, and 0.40, while BLEU scores for unigram, bigram, trigram, and four-gram matching are 0.66, 0.59, 0.44 and 0.26 respectively. The study demonstrates that the proposed approach produces precise image descriptions, outperforming other state-of-the-art models in generating Bengali descriptions.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have