Abstract
Image captioning requires not only accurate recognition of objects and corresponding relationships, but also full comprehension of the scene information. However, existing models suffer from partial understanding and object hallucination. In this paper, a Cross-modal retrievAl and viSual condiTioning model (CAST) is proposed to address the above issues for image captioning with three key modules: an image–text retriever, an image & memory comprehender and a dual attention decoder. Aiming at a comprehensive understanding, we propose to exploit cross-modal retrieval to mimic human cognition, i.e., to trigger retrieval of contextual information (called episodic memory) about a specific event. Specifically, the image–text retriever searches the top n relevant sentences which serve as episodic memory for each input image. Then the image & memory comprehender encodes an input image and enriches episodic memory by self-attention and relevance attention respectively, which can encourage CAST to comprehend the scene thoroughly and support decoding more effectively. Finally, such image representation and memory are integrated into our dual attention decoder, which performs visual conditioning by re-weighting image and text features to alleviate object hallucination. Extensive experiments are conducted on MS COCO and Flickr30k datasets, which demonstrate that our CAST achieves state-of-the-art performance. Our model also has a promising performance even in low-resource scenarios (i.e. 0.1%, 0.5% and 1% of MS COCO training set).
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have