Abstract

Image captioning has attracted huge attention from deep learning researchers. This approach combines image and text-based deep learning techniques to create the written descriptions of images automatically. There has been limited research on image captioning using the Nepali language, with most studies focusing on English datasets. Therefore, there are no publicly available datasets in the Nepali language. Most previous works are based on the RNN-CNN approach, which produces inferior results compared to image captioning using the Transformer model. Similarly, using the BLEU score as the only evaluation metric cannot justify the quality of the produced captions. To address this gap, in this research work, the well-known “Flickr8k” English data set is translated into Nepali language and then manually corrected to ensure accurate translations. The conventional Transformer is comprised of encoder and decoder modules. Both modules contain a multi-head attention mechanism. This makes the model complex and computationally expensive. Hence, we propose a noble approach where the encoder module of the Transformer is completely removed and only the decoder part of the Transformer is used, in conjunction with CNN, which acts as a feature extractor. The image features are extracted using the MobileNetV3 Large while the Transformer decoder processes these feature vectors and the input text sequence to generate appropriate captions. The system's effectiveness is measured using metrics to judge the caliber and precision of the generated captions, such as the BLEU and Meteor scores.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.