Impact of Language-Specific Training on Image Caption Synthesis: A Case Study on Low-Resource Assamese Language

Pankaj Choudhury,Sukumar Nandi,Prithwijit Guha

doi:10.1142/s2717554524500048

Abstract

Automatic Image Captioning (AIC) refers to the process of synthesizing semantically and syntactically correct descriptions for images. Existing research on AIC has predominantly focused on the English language. Comparatively, lower numbers of works have focused on developing captioning systems for low-resource Indian languages like Assamese. This paper investigates AIC for the Assamese language using two distinct approaches. The first approach involves utilizing state-of-the-art AIC model pretrained on an English image-caption dataset to generate English captions for input images. Next, these English captions are translated to the Assamese language using a publicly available automatic translator. The second approach involves exclusively training the AIC model using an Assamese image-caption dataset to predict captions directly in Assamese. The experiments are performed on two types of state-of-art models, one which uses LSTM as a decoder and the other one uses a transformer. Through extensive experimentation, the performance of these approaches is evaluated both quantitatively and qualitatively. The quantitative results are obtained using automatic metrics such as BLEU-n and CIDEr. For qualitative analysis, human evaluation is performed. The comparative performances between the two approaches reveal that models trained exclusively on Assamese image-caption datasets achieve superior results both in terms of quantitative measures and qualitative assessment when compared to models pretrained on English and subsequently translated into Assamese.

Full Text