Abstract

Image-text matching has received a large amount of interest since it associates different modalities and improves the understanding of image and natural language. It aims to retrieval semantic related images based on the given text query, and vice versa. Existing approaches have achieved much progress by projecting the image and text into a common space where data with different semantics can be distinguished. However, they process all the data points uniformly, while neglecting that data in a neighborhood are harder to distinguish due to their visual similarity or syntactic structural similarity. To address this issue, we propose a neighbor-aware network to image-text matching where an intra-attention module and neighbor-aware ranking loss are proposed to jointly distinguish data with different semantics, more importantly, semantic unrelated data in a neighborhood can be distinguished. The intra-attention attends to discriminative parts by comparing data with different semantics and magnifying difference between them, especially subtle difference between data in a neighborhood. The neighbor-aware ranking loss function utilizes the magnified difference to explicitly and effectively discriminate data in a neighborhood. We conduct extensive experiments on several benchmarks and show that the proposed approach significantly outperforms the state-of-the-art.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call