INDOOR VISUAL UNDERSTANDING THROUGH IMAGE CAPTIONING

Dhomas Hatta Fudholi,Royan Abida N Nayoan

doi:10.11113/aej.v14.20285

Abstract

Transformers have been widely used in image captioning tasks on English language datasets such as MSCOCO and Flickr. However, research related to image captioning in Indonesian is still rare and relies on machine translation to obtain the Indonesian dataset. In this study, the Transformer model is used to generate caption using the modified MSCOCO datasets to gain visual understanding in an indoor environment. We modified the MSCOCO dataset by creating new Indonesian text description based on the MSCOCO images. A few simple rules are made to create the Indonesian dataset by including the object’s location, colour, and its characteristics. Experiments were carried out using several CNN pre-trained models to extract the image features before feeding them to the Transformer model. We also performed hyper-parameter settings on the models by assigning different values for batch size, dropouts, and attention heads to get the best model. BLEU-n, METEOR, CIDEr, and ROUGE-L are used to evaluate the model. From this study, by utilizing the EfficientNetB0 with a batch size of 128, dropouts of 0.2, and attention heads of 4, the model can get the best score in four different evaluation matrices. The EfficientNetB0 model reached the highest score on BLEU-4 with a score of 0.344, ROUGE-L of 0.535, METEOR of 0.264, and CIDEr of 0.492.

Full Text