Full-view salient feature mining and alignment for text-based person search

Sheng Xie,Canlong Zhang,Enhao Ning,Zhixin Li,Zhiwen Wang,Chunrong Wei

doi:10.1016/j.eswa.2024.124071

Abstract

Text-based person search aims to retrieve relevant person images from a large database given textual queries. However, single-view limitation of surveillance cameras and cross-modal heterogeneity still remain challenging open issues. To address these, we propose a Full-view Salient Feature Mining Network (FLAN) to improve text-image matching in this task. Our FLAN introduces two key innovations. First, the Diffusion-based Full-view Image Augmentation generates informative full-view data from a single image to simulate human visual observation and learn view-invariant features. Second, the Dual-max Text Attention module optimizes spatial and channel-wise text attentions to extract the most discriminative words characterizing the person. Together, these innovations handle insufficient, imbalanced, and heterogeneous data for more accurate matching. Extensive experiments on three text-based person search datasets, CUHK-PEDES, ICFG-PEDES and RSTPReid, demonstrate superior performance of our FLAN with improved robustness and generalization.

Full Text