Automatic captioning of images contributes to identifying features of multimedia content and helps in the detection of interesting patterns, trends, and occurrences. English image captioning has recently made incredible progress; however, Arabic image captioning is still lagging. In the field of machine learning, Arabic image-caption generation is generally a very difficult problem. This paper presents a more accurate model for Arabic image captioning by using transformer models in both the encoder and decoder phases as feature extractors from images in the encoder phase and a pre-trained word embedding model in the decoder phase. The models are demonstrated, and all of them are implemented, trained, and tested on Arabic Flickr8k datasets. For the image feature extraction subsystem, we compared using three different individual vision models (SWIN, XCIT, and ConvNexT) with concatenation to get among them the most expressive extracted feature vector of the image, and for the caption generation lingual subsystem, which is tested by four different pre-trained language embedding models: (ARABERT, ARAELECTRA, MARBERTv2, and CamelBERT), to select from them the most accurate pre-trained language embedding model. Our experiments showed that building an Arabic image captioning system that uses a concatenation of the three transformer-based models ConvNexT combined with SWIN and XCIT as an image feature extractor, combined with the CamelBERT language embedding model produces the best results among the other combinations, having scores of 0.5980 with BLEU-1 and with ConvNexT combined with SWIN the araelectra language embedding model having a score of 0.1664 with BLEU-4 which are higher than the previously reported values of 0.443 and 0.157.
Read full abstract