Dual-Stream Transformer With Distribution Alignment for Visible-Infrared Person Re-Identification

Zehua Chai,Shaozi Li,Dazhen Lin,Yongguo Ling,Zhiming Luo,Min Jiang

doi:10.1109/tcsvt.2023.3268080

Abstract

Visible-infrared person re-identification(VI-ReID) aims to match the person images captured by visible and infrared cameras and suffers from severe cross-modality discrepancy and intra-modality variations. Existing approaches mainly use convolution neural network (CNN)-based architectures to extract pedestrian features, which fail to capture the long-range dependencies within an image. In addition, previous works usually attempt to bridge the modality gap by using adversarial learning to generate style-consistent images or designing different feature-level metric learning constraints. However, few works consider the cross-modality disparity from the perspective of assessing overall distance distribution discrepancy. To address these problems, we design a pure Transformer-based Visible-Infrared (TransVI) network with a conventional two-stream structure, which can explicitly capture modality-specific representations and learn multi-modality sharable knowledge. TransVI can efficiently address the lack of global dependency in CNN-based architectures due to the multi-head self-attention modules in the transformer, which allows us to capture the long-range dependencies of pedestrian images. Furthermore, we introduce the Cross-Modality Dissimilarity-based Maximum Mean Discrepancy (CMD-MMD) constraint to handle the cross-modality discrepancy at the distance distribution level. Specifically, CMD-MMD leverages intra-modality distribution separability to guide inter-modality distribution separability learning, aligning pair-wise distance distributions of intra- and inter-modality for within-class and between-class, respectively. In this way, the distance distributions of intra- and inter-modality become more similar, significantly mitigating the cross-modality discrepancy and learning more modality invariant representations. Extensive experimental results on two public VI-ReID datasets confirm that our proposed framework can achieve state-of-the-art performance.

Full Text