AraCap: A hybrid deep learning architecture for Arabic Image Captioning

Imad Afyouni,Imtinan Azhar,Ashraf Elnagar

doi:10.1016/j.procs.2021.05.108

Abstract

Automatic captioning of images no only enrich multimedia content with descriptive features, but also helps in detecting patterns, trends, and events of interest. Particularly, Arabic Image Caption Generation is a very challenging topic in the machine learning field. This paper presents, AraCap, a hybrid object-based, attention-enriched image captioning architecture, with a focus on Arabic language. Three models are demonstrated, all of them are implemented and trained on COCO and Flickr30k datasets, and then tested by building an Arabic version of a subset of COCO dataset. The first model is an object-based captioner that can handle one or multiple detected objects. The second is a combined pipeline that uses both object detector and attention-based captioning; while the third one is based on a pure soft attention mechanism. The models are evaluated using multi-lingual semantic sentence similarity techniques to assess the generated captions accuracy against the actual ground truth captions. Results show that similarity scores for Arabic generated captions from all three proposed models outperformed the basic captioning technique.

Full Text