Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval

Huaxin Pang,Shikui Wei,Shiyin Zhang,Shuang Qiu,Yao Zhao,Gangjian Zhang

doi:10.1109/tmm.2022.3208742

Abstract

Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In this work, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Cross-modal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.

Full Text