Parallel Semantic Fusion Image Caption Generation Analysis Theory

Huawei Zhang,Chengbo Ma

doi:10.1109/itaic54216.2022.9836697

Abstract

Image caption is a multimodal task, whose purpose is to input an image and automatically generate coherent sentences from the visual information of the analyzed image to describe the content of the image. However, in most models fusing image information with semantic information, the extracted visual information is insufficient. In this paper, a parallel double-layer LSTM(d-LSTM) is proposed as a decoder to process semantic information. The semantic information obtained from the hidden state of the first layer is used as the primary information of the semantic information generated by the second layer. Finally, the semantic information of the two layers of decoders is fused to generate a finer-grained image caption. The superiority of our proposed model is verified by large-scale experiments on MSCOCO datasets.

Full Text