Abstract

Text-based person search aims to retrieve relevant person images from a large database given textual queries. However, single-view limitation of surveillance cameras and cross-modal heterogeneity still remain challenging open issues. To address these, we propose a Full-view Salient Feature Mining Network (FLAN) to improve text-image matching in this task. Our FLAN introduces two key innovations. First, the Diffusion-based Full-view Image Augmentation generates informative full-view data from a single image to simulate human visual observation and learn view-invariant features. Second, the Dual-max Text Attention module optimizes spatial and channel-wise text attentions to extract the most discriminative words characterizing the person. Together, these innovations handle insufficient, imbalanced, and heterogeneous data for more accurate matching. Extensive experiments on three text-based person search datasets, CUHK-PEDES, ICFG-PEDES and RSTPReid, demonstrate superior performance of our FLAN with improved robustness and generalization.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call