Abstract

Text-based person search is an important task in video surveillance, which aims to retrieve the corresponding pedestrian images with a given description. In this fine-grained retrieval task, accurate cross-modal information matching is an essential yet challenging problem. However, existing methods usually ignore the information inequality between modalities, which could introduce great difficulties to cross-modal matching. Specifically, in this task, the images inevitably contain some pedestrian-irrelevant noise like background and occlusion, and the descriptions could be biased to partial pedestrian content in images. With that in mind, in this paper, we propose a Text-Guided Denoising and Alignment (TGDA) model to alleviate the information inequality and realize effective cross-modal matching. In TGDA, we first design a prototype-based denoising module, which integrates pedestrian knowledge from textual features into a prototype vector and uses it as guidance to filter out pedestrian-irrelevant noise from visual features. Thereafter, a bias-aware alignment module is introduced, which guides our model to focus on the description-biased pedestrian content in cross-modal features consistently. Through extensive experiments, the effectiveness of both modules has been validated. Besides, our TGDA achieves state-of-the-art performance on various related benchmarks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call