Abstract

Person search is a computer vision task that aims to locate and re-identify specific pedestrians in images captured by non-overlapping cameras. However, the identity annotation in person search is labor-intensive, especially as the amount of data increases. Therefore, more and more studies consider training person search models using weakly supervised learning with only location annotations. The context information is useful to improve feature representations in the absence of pedestrian identity as supervision. Existing weakly supervised person search methods focus on logic-driven contexts while ignoring feature contexts. In this paper, we propose a hybrid deep network for weakly supervised person search. The hybrid architecture consists of a Transformer-based feature extraction network and a fully convolution-based region recognition head network. The purpose is to enable the model to learn feature contexts at different levels. In our network, hierarchical vision Transformers are used to extract features in order to obtain discriminative representations of scene images. The context-enhanced head network is designed to integrate different features for candidate pedestrians. In addition, a pedestrian proposal network is proposed to improve the quality of predicted proposals. Experiments are conducted on the CUHK-SYSU and the PRW benchmarks to evaluate the effectiveness of the proposed method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call