Image Captioning in Nepali Using CNN and Transformer Decoder

Rabin Budhathoki,Suresh Timilsina

doi:10.3126/jes2.v2i1.60391

Abstract

Image captioning has attracted huge attention from deep learning researchers. This approach combines image and text-based deep learning techniques to create the written descriptions of images automatically. There has been limited research on image captioning using the Nepali language, with most studies focusing on English datasets. Therefore, there are no publicly available datasets in the Nepali language. Most previous works are based on the RNN-CNN approach, which produces inferior results compared to image captioning using the Transformer model. Similarly, using the BLEU score as the only evaluation metric cannot justify the quality of the produced captions. To address this gap, in this research work, the well-known “Flickr8k” English data set is translated into Nepali language and then manually corrected to ensure accurate translations. The conventional Transformer is comprised of encoder and decoder modules. Both modules contain a multi-head attention mechanism. This makes the model complex and computationally expensive. Hence, we propose a noble approach where the encoder module of the Transformer is completely removed and only the decoder part of the Transformer is used, in conjunction with CNN, which acts as a feature extractor. The image features are extracted using the MobileNetV3 Large while the Transformer decoder processes these feature vectors and the input text sequence to generate appropriate captions. The system's effectiveness is measured using metrics to judge the caliber and precision of the generated captions, such as the BLEU and Meteor scores.

Full Text