Existing person re-identification methods utilizing CLIP (Contrastive Language-Image Pre-training) mostly suffer from coarse-grained alignment issues. This is primarily due to the original design intention of the CLIP model, which aims at broad and global alignment between images and texts to support a wide range of image–text matching tasks. However, in the specific domain of person re-identification, local features and fine-grained information are equally important in addition to global features. This paper proposes an innovative modal fusion approach, aiming to precisely locate the most prominent pedestrian information in images by combining visual features extracted by the ResNet-50 model with text representations generated by a text encoder. This method leverages the cross-attention mechanism of the Transformer Decoder to enable text features to dynamically guide visual features, enhancing the ability to identify and locate the target pedestrian. Experiments conducted on four public datasets, namely MSMT17, Market1501, DukeMTMC, and Occluded-Duke, demonstrate that our method outperforms the baseline network by 5.4%, 2.7%, 2.6%, and 9.2% in mAP, and by 4.3%, 1.7%, 2.7%, and 11.8% in Rank-1, respectively. This method exhibits excellent performance and provides new research insights for the task of person re-identification.
Read full abstract