Learning transformer-based attention region with multiple scales for occluded person re-identification

Zhi Liu,Xingyu Mu,Yunhua Lu,Tingting Zhang,Yingli Tian

doi:10.1016/j.cviu.2023.103652

Abstract

Occluded person re-identification(Re-ID), with the aim of matching occluded person pairs under cross-camera, remains challenging due to incomplete information and spatial misalignment. The state-of-the-art (SOTA) methods usually include a two-stage architecture based on the existing pose estimation models or the attention mechanism to generate human masks to extract features, which complicate the model and introduce additional biases. To address this issue, we propose a novel end-to-end transformer-based occluded person Re-ID model. Specifically, our model contains two crucial components: (1) the features of global and non-occluded person regions are extracted by two independent Transformer-based feature extraction networks respectively; (2) the distribution of common non-occluded human regions is learnt via a multiheaded self-attention mechanism, and then the Minimized Character-box Proposal (MCP) is utilized to generate accurate shared non-occluded crops. In our model, non-occluded human regions are not annotated and only weakly-supervision of ID labels with multiheaded self-attention are employed to jointly learn the distribution. Further, the human feature contains multi-scale information which is extracted from our dual-branch architecture. Extensive experiment results on four benchmarks of person Re-ID for two tasks (occluded, partial) demonstrate the effectiveness of our proposed framework which achieves the SOTA or the comparable performance on all benchmarks.

Full Text