Ar-CM-ViMETA: Arabic Image Captioning based on Concept Model and Vision-based Multi-Encoder Transformer Architecture

Asmaa Osman,Mona Soliman,Mohamed Shalaby,Khaled Elsayed

doi:10.34028/iajit/21/3/9

Abstract

Image captioning is a major artificial intelligence research field that involves visual interpretation and linguistic description of a corresponding image. Successful image captioning relies on acquiring as much information as feasible from the original image. One of these essential bits of knowledge is the topic or the concept that the image is associated with. Recently, concept modeling technique has been utilized in English image captioning for completely capturing the image contexts and make use of these contexts to produce more accurate image descriptions. In this paper, a concept-based model is proposed for Arabic Image Captioning (AIC). A novel Vision-based Multi-Encoder Transformer Architecture (ViMETA) is proposed for handling the multi-outputs result from the concept modeling technique while producing the image caption. BiLingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) standard metrics have been used to evaluate the proposed model using the Flickr8K dataset with Arabic captions. Furthermore, qualitative analysis has been conducted to compare the produced captions of the proposed model with the ground truth descriptions. Based on the experimental results, the proposed model outperformed the related works both quantitatively, using BLEU and ROUGE metrics, and qualitatively

Full Text