Joint Token and Feature Alignment Framework for Text-Based Person Search

Shangze Li,Yan Huang,Andong Lu,Liang Wang,Chenglong Li

doi:10.1109/lsp.2022.3217682

Abstract

Text-based person search is a challenging crossmodal retrieval task. Existing works reduce the inter-modality and intra-class gaps by aligning local features extracted from image and text modalities, which easily lead to mismatching problems due to the lack of annotation information. Besides, it is sub-optimal to reduce two gaps simultaneously in the same feature space. This work proposes a novel joint token and feature alignment framework to reduce the inter-modality and intraclass gaps progressively. Specifically, we first build a dual-path feature learning network to extract features and conduct feature alignment to reduce the inter-modality gap. Second, we design a text generation module to generate token sequences using visual features, and then token alignment is performed to reduce the intra-class gap. Last, a fusion interaction module is introduced to further eliminate the modality heterogeneity using the strategy of multi-stage feature fusion. Extensive experiments on the CUHKPEDES dataset demonstrate the effectiveness of our model, which significantly outperforms previous state-of-the-art methods.

Full Text