Abstract

Visible-infrared person re-identification (VI Re-ID) is designed to match person images of the same identity from visible and infrared cameras. Transformer structures have been successfully applied in the field of VI Re-ID. However, previous Transformer-based methods were mainly designed to capture global content information in a single modality, and could not simultaneously perceive semantic information between two modalities from a global perspective. To solve this problem, we propose a novel framework named the cross-modality interaction Transformer (CMIT). It has strong abilities in modeling spatial and sequential features that can capture dependencies between long-range features, and explicitly improves the discriminativeness of features by exchanging information across modalities, thus contributing to obtaining modality-invariant representations. Specifically, CMIT utilizes a cross-modality attention mechanism to enrich the feature representations of each patch token by interacting with the patch tokens of the other modality, and aggregates local features of the CNN structure and global information of the Transformer structure to mine feature saliency representation. Furthermore, the modality-discriminative (MD) loss function is proposed to learn potential consistency between modalities to encourage intra-modality compactness within class and inter-modality separation between classes. Extensive experiments on two benchmarks demonstrate that our approach outperforms state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call