Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality retrieval task to match a person across different spectral camera views. Most existing works focus on learning shared feature representations from the final embedding space of advanced networks to alleviate modality differences between visible and infrared images. However, exclusively relying on high-level semantic information from the network's final layers can restrict shared feature representations and overlook the benefits of low-level details. Different from these methods, we propose a multi-scale contrastive learning network (MCLNet) with hierarchical knowledge synergy for VI-ReID. MCLNet is a novel two-stream contrastive deep supervision framework designed to train low-level details and high-level semantic representations simultaneously. MCLNet utilizes supervised contrastive learning (SCL) at each intermediate layer to strengthen visual representations and enhance cross-modality feature learning. Furthermore, a hierarchical knowledge synergy (HKS) strategy for pairwise knowledge matching promotes explicit information interaction across multi-scale features and improves information consistency. Extensive experiments on three benchmarks demonstrate the effectiveness of MCLNet.
Read full abstract