A study on attention-based deep learning architecture model for image captioning

Umar Abdul Aziz Al-Faruq,Annisa Zahra,Dhomas Hatta Fudholi,Royan Abida N. Nayoan

doi:10.11591/ijai.v13.i1.pp23-34

Abstract

<span lang="EN-US">Image captioning has been widely studied due to its ability in a visual scene understanding. Automatic visual scene understanding is useful for remote monitoring system and visually impaired people. Attention-based models, including transformer, are the current state-of-the-art architectures used in developing image captioning model. This study examines the works in the development of image captioning model, especially models that are developed based on attention mechanism. The architecture, the dataset, and the evaluation metrics analysis are done to the collected works. A general flow of image captioning model development is also presented. The literature search process carried out on Google Scholar. There are 36 literatures used in this study, including a specific image captioning development in Indonesian. It is done to take one point of view of image captioning development in a low resource language. Studies using transformer model generally achieves higher evaluation metric scores. In our finding, the highest evaluation scores on the consensus-based image description evaluation (CIDEr) c5 and c40 metrics are 138.5 and 140.5 respectively. This study gives a baseline on future development of image captioning model and brings the general concept of the image captioning development process including a picture of the development in low resource language.</span>

Full Text