Boosting convolutional image captioning with semantic content and visual relationship

Cong Bai,Anqi Zheng,Yuan Huang,Xiang Pan,Nan Chen

doi:10.1016/j.displa.2021.102069

Abstract

Image captioning aims to display automatically the natural language sentence for the image by the computer, which is an important but a challenging task which covers the fields of computer vision and natural language processing. This task is dominated by Long-short term memory (LSTM) based solutions. Although many progresses have been made based on LSTM in recent years, the model based on LSTM relies on serialized generation of descriptions, which cannot be processed in parallel and pay less attentions to the hierarchical structure of the captions. In order to solve this problem, we propose a framework using a CNN-based generation model to generate image captions with the help of conditional generative adversarial training (CGAN). Furthermore, multi-modal graph convolution network(MGCN) is used to exploit visual relationships between objects for generating the captions with semantic meanings, in which the scene graph is used as the bridge to connect objects, attributes and visual relationship information together to generate better captions. Extensive experiments are conducted on MSCOCO database and the results show that our method could achieve better or comparable scores compared with state-of-the-art methods. Ablation experimental results show that CGAN and MGCN can reflect a better visual relationships between objects in image and thus can generate better captions with richer semantic content.

Full Text