Abstract

Occluded person Re-Identification (Re-ID) is built on cross views, which aims to retrieve a target person in occlusion scenes. Under the condition that occlusion leads to the interference of other objects and the loss of personal information, the efficient extraction of personal feature representation is crucial to the recognition accuracy of the system. Most of the existing methods solve this problem by designing various deep networks, which are called convolutional neural networks (CNN)-based methods. Although these methods have the powerful ability to mine local features, they may fail to capture features containing global information due to the limitation of the gaussian distribution property of convolution operation. Recently, methods based on Vision Transformer (ViT) have been successfully employed to person Re-ID task and achieved good performance. However, since ViT-based methods lack the capability of extracting local information from person images, the generated results may severely lose local details. To address these deficiencies, we design a convolution and self-attention aggregation network (CSNet) by combining the advantages of both CNN and ViT. The proposed CSNet consists of three parts. First, to better capture personal information, we adopt Dual-Branch Encoder (DBE) to encode person images. Then, we also embed a Local Information Aggregation Module (LIAM) in the feature map, which effectively leverages the useful information in the local feature map. Finally, a Multi-Head Global-to-Local Attention (MHGLA) module is designed to transmit global information to local features. Experimental results demonstrate the superiority of the proposed method compared with the state-of-the-art (SOTA) methods on both the occluded person Re-ID datasets and the holistic person Re-ID datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call