Image and text matching bridges visual and textual modality differences and plays a considerable role in cross-modal retrieval. Much progress has been achieved through semantic representation and alignment. However, the distribution of multimedia data is severely unbalanced and contains many low-frequency occurrences, which are often ignored and cause performance degradation, i.e., the long-tail effect. In this work, we propose a novel rare-aware attention network (RAAN), which explores and exploits textual rare content for tackling the long-tail effect of image and text matching. Specifically, we first design a rare-aware mining module, which contains global prior information construction and rare fragment detector for modeling the characteristic of rare content. Then, the rare attention matching utilizes prior information as attention to guide the representation enhancement of rare content and introduces the rareness representation to strengthen the similarity calculation. Finally, we design prior information loss to optimize the model together with the triplet loss. We perform quantitative and qualitative experiments on two large-scale databases and achieve leading performance. In particular, we conduct 0-shot test for rare content and improve rSum by 21.0 and 41.5 on Flickr30K (155,000 image and text pairs) and MSCOCO (616,435 image and text pairs), demonstrating the effectiveness of the proposed method for the long-tail effect.
Read full abstract