Traditional methods that use pose information for person re-identification (ReID) often focus only on relationships between different parts of the same sample and ignore the correlation between poses of different samples, which hinders further improvement in recognition accuracy. In view of this, this paper proposes a pose-guided self and external attention feature matching and aggregation net which contains a visual context self and external attention module and a pose-guided feature matching and aggregation module. It can divide global features into local features without strict spatial feature alignment, and enables the model to focus on pose correlation between different samples. Thus, it not only learns pose-related identity identification features, but also avoids interference from the samples with large pose changes. The visual context self and external attention module realizes feature embedding through a transformer-based image classification model ViT and a human pose estimation model. Also, it extracts encoder features and local features containing rich pose information by adding an external attention. The pose-guided feature matching and aggregation module obtains a learnable part semantic view through a transformer decoder and pose heat maps. Then, it matches and aggregates with local features, adaptively learns pose-related features, and enhances the robustness of the model to background and occlusion. In this paper, experiments are conducted based on three data sets, including Market-1501, DukeMTMC-reID and MSMT17. The Rank-1 values are 95.3%, 90.1%, and 82.5%, respectively, and the mAP values are 88.3%, 81.1%, and 64.1%, respectively. In conclusion, our method introduces an external attention into the ReID task, and the ReID model can obtain more accurate results by extracting pose-related identity identification features and avoiding interference from samples with large pose changes.
Read full abstract