Abstract

Person re-identification (ReID) is the problem of cross-camera target retrieval. The extraction of robust and discriminant features is the key factor in realizing the correct correlation of targets. A model based on convolutional neural networks (CNNs) can extract more robust image features. Still, it completes the extraction of images from local information to global information by continuously accumulating convolution layers. As a complex CNN, a vision transformer (ViT) captures global information from the beginning to extract more powerful features. This paper proposes an unsupervised domain adaptive person re-identification model (ViTReID) based on the vision transformer, taking the ViT model trained on ImageNet as the pre-training weight and a transformer encoder as the feature extraction network, which makes up for some defects of the CNN model. At the same time, the combined loss function of cross-entropy and triplet loss function combined with the center loss function is used to optimize the network; the person’s head is evaluated and trained as a local feature combined with the global feature of the whole body, focusing on the head, to enhance the head feature information. The experimental results show that ViTReID exceeds the baseline method (SSG) by 14% (Market1501 → MSMT17) in mean average precision (mAP). In MSMT17 → Market1501, ViTReID is 1.2% higher in rank-1 (R1) accuracy than a state-of-the-art method (SPCL); in PersonX → MSMT17, the mAP is 3.1% higher than that of the MMT-dbscan method, and in PersonX → Market1501, the mAP is 1.5% higher than that of the MMT-dbscan method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call