Abstract

AbstractPerson Re-identification (ReID) is a computer vision task of retrieving a person of interest across multiple non-overlapping cameras. Most of the existing methods are developed based on convolutional neural networks trained with supervision, which may suffer from the problem of missing subtle visual cues and global information caused by pooling operations and the weight-sharing mechanism. To this end, we propose a novel Transformer-based ReID method by pre-training with Masked Autoencoders and fine-tuning with locality enhancement. In our method, Masked Autoencoders are pre-trained in a self-supervised way by using large-scale unlabeled data such that subtle visual cues can be learned with the pixel-level reconstruction loss. With the Transformer backbone, global features are extracted as the classification token which integrates information from different patches by the self-attention mechanism. To take full advantage of local information, patch tokens are reshaped into a feature map for convolution to extract local features in the proposed locality enhancement module. Both global and local features are combined to obtain more robust representations of person images for inference. To the best of our knowledge, this is the first work to utilize generative self-supervised learning for representation learning in ReID. Experiments show that the proposed method achieves competitive performance in terms of parameter scale, computation overhead, and ReID performance compared with the state of the art. Code is available at https://github.com/YanzuoLu/MALE.KeywordsMasked AutoencodersTransformerSelf-supervised learningPerson re-identification

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call