Semantic-Enhanced Cross-Modal Fusion for Improved Unsupervised Image Captioning

Nan Xiang,Xingdi Rao,Leiyan Liang,Ling Chen,Zehao Gong

doi:10.3390/electronics12173549

Nan Xiang, Xingdi Rao + Show 3 more

Open Access

https://doi.org/10.3390/electronics12173549

Copy DOI

Abstract

Unsupervised image captioning often grapples with challenges such as image–text mismatches and modality gaps, resulting in suboptimal captions. This paper introduces a semantic-enhanced cross-modal fusion model (SCFM) to address these issues. The SCFM integrates three innovative components: a text semantic enhancement network (TSE-Net) for nuanced semantic representation; contrast learning for optimizing similarity measures between text and images; and enhanced visual selection decoding (EVSD) for precise captioning. Unlike existing methods that struggle with capturing accurate semantic relationships and flexibility across scenarios, the proposed model provides a robust solution for unbiased and diverse captioning. Through experimental evaluations on the MS COCO and Flickr30k datasets, SCFM demonstrates significant improvements over the benchmark model, enhancing the CIDEr and BLEU-4 metrics by 3.6% and 3.2%, respectively. Visualization analysis further reveals the model’s superiority in increasing variability between hidden features and its potential in cross-domain and stylized image captioning. The findings not only contribute to the advancement of image captioning techniques but also open avenues for future research. Further investigations will explore SCFM’s adaptability to other multimodal tasks and refine it for more intricate image–text relationships.

Full Text