Abstract

Composed image retrieval aims at performing image retrieval task by giving a reference image and a complementary text piece. Since composing both image and text information can accurately model the users' search intent, composed image retrieval can perform target-specific image retrieval task and be potentially applied to many scenarios such as interactive product search. However, two key challenging issues must be addressed in composed image retrieval occasion. One of them is how to fuse heterogeneous image and text piece in the query into a complementary feature space. The other is how to bridge the heterogeneous gap between text pieces in the query and images in the database. To address the issues, we propose an end-to-end framework for composed image retrieval, which consists of three key components including Multi-modal Complementary Fusion (MCF), Cross-modal Guided Pooling (CGP), and Relative Caption-aware Consistency (RCC). By incorporating MCF and CGP modules, we can fully integrate the complementary information of image and text piece in the query through multiple deep interactions and aggregate obtained local features into an embedding vector. To bridge the heterogeneous gap, we introduce the RCC constraint to align text pieces in the query and images in the database. Extensive experiments on four public benchmark datasets show that the proposed composed image retrieval framework achieves outstanding performance against the state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call