Abstract

Text-based person search is a challenging crossmodal retrieval task. Existing works reduce the inter-modality and intra-class gaps by aligning local features extracted from image and text modalities, which easily lead to mismatching problems due to the lack of annotation information. Besides, it is sub-optimal to reduce two gaps simultaneously in the same feature space. This work proposes a novel joint token and feature alignment framework to reduce the inter-modality and intraclass gaps progressively. Specifically, we first build a dual-path feature learning network to extract features and conduct feature alignment to reduce the inter-modality gap. Second, we design a text generation module to generate token sequences using visual features, and then token alignment is performed to reduce the intra-class gap. Last, a fusion interaction module is introduced to further eliminate the modality heterogeneity using the strategy of multi-stage feature fusion. Extensive experiments on the CUHKPEDES dataset demonstrate the effectiveness of our model, which significantly outperforms previous state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call