Abstract

ABSTRACT Previous captioning methods only rely on semantic-level information considering the similarity of features between image regions in visual space and ignoring the linguistic context incorporated in the decoder for caption generation. In this paper, a transformer-based co-attention network is proposed which uses linguistic information to capture the pairwise visual relationships among objects and significant visual features. We infer the entity words from the visual content of objects, during the caption generation process. Also, we infer the interactive words by focusing on the relationship between entity words, based on the relational context between words generated in the course of caption decoding. We use linguistic contextual information as a guiding force to discover the relationships between objects efficiently. Further, we capture both intra-modal and inter-modal interactions using the multilevel co-attention network. Our model attains 44.1/33.6 BLEU@4, 30.8/25.1 METEOR, 61.9/55.1 ROUGE, 132.1/69.8 CIDEr, and 24.1/17.8 SPICE scores on MSCOCO and Flickr30k datasets, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call