CAST: Cross-Modal Retrieval and Visual Conditioning for image captioning

Shan Cao,Yigang Cen,Weisi Lin,Zhaoqilin Yang,Gaoyun An

doi:10.1016/j.patcog.2024.110555

Shan Cao, Yigang Cen + Show 3 more

https://doi.org/10.1016/j.patcog.2024.110555

Copy DOI

Export

Save

Cite

Journal: Pattern Recognition	Publication Date: May 3, 2024
Citations: 6

Abstract
Full-Text
Similar Papers

Abstract

Listen

Image captioning requires not only accurate recognition of objects and corresponding relationships, but also full comprehension of the scene information. However, existing models suffer from partial understanding and object hallucination. In this paper, a Cross-modal retrievAl and viSual condiTioning model (CAST) is proposed to address the above issues for image captioning with three key modules: an image–text retriever, an image & memory comprehender and a dual attention decoder. Aiming at a comprehensive understanding, we propose to exploit cross-modal retrieval to mimic human cognition, i.e., to trigger retrieval of contextual information (called episodic memory) about a specific event. Specifically, the image–text retriever searches the top n relevant sentences which serve as episodic memory for each input image. Then the image & memory comprehender encodes an input image and enriches episodic memory by self-attention and relevance attention respectively, which can encourage CAST to comprehend the scene thoroughly and support decoding more effectively. Finally, such image representation and memory are integrated into our dual attention decoder, which performs visual conditioning by re-weighting image and text features to alleviate object hallucination. Extensive experiments are conducted on MS COCO and Flickr30k datasets, which demonstrate that our CAST achieves state-of-the-art performance. Our model also has a promising performance even in low-resource scenarios (i.e. 0.1%, 0.5% and 1% of MS COCO training set).

Full Text