Abstract

Traditional image captioning models mainly rely on one encoder-decoder architecture to generate one natural sentence for a given image. Such an architecture mostly uses deep neural networks to extract the neural representations of the image while ignoring the information of abstractive concepts as well as their intertwined relationships conveyed in the image. To this end, to comprehensively characterize the image content and bridge the gap between neural representations and high-level abstractive concepts, we make the first attempt to investigate the ability of neural symbolic representation of the image for the image captioning task. We first parse and convert a given image to neural symbolic representation in the form of an attributed relational graph, with the nodes denoting the abstractive concepts and the branches indicating the relationships between connected nodes, respectively. By performing computations over the attributed relational graph, the neural symbolic representation evolves step by step, with the node and branch representations as well as their corresponding importance weights transiting step by step. Empirically, extensive experiments validate the effectiveness of the proposed method. It enables a more comprehensive understanding of the given image by integrating the neural representation and neural symbolic representation, with the state-of-the-art results being achieved on both the MSCOCO and Flickr30k datasets. Besides, the proposed neural symbolic representation is demonstrated to better generalize to other domains with significant performance improvements compared with existing methods on the cross domain image captioning task.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call