Person reidentification (reID) is a complex problem that can be addressed by exploiting the complementary multi-modal information. We accomplish this by mapping these modalities into a shared space, which facilitates 24-hour surveillance systems. However, current methods for visible–infrared-based cross-modal person reID primarily concentrate on image-to-image matching, while the potential of image-to-video and video-to-video matching that offers a rich spatial–temporal representation is yet to be fully investigated. Existing cross-modal reID methods rely on score fusion or feature integration techniques to merge various heterogeneous and complementary multi-modalities, unfortunately, these methods fall short of fully utilizing the complementary information offered by different modalities. To overcome the above drawbacks, this study proposes a Cross-modality Cross-scale Fusion Transformer (CMFT) for multi-scale visible–infrared complementary information interaction, yielding a more comprehensive representation for person reID. The Cross-Modality cross-scale Fusion (CCF) is a fundamental component of the CMFT, designed to capture cross-modal correlations and propagate the fused complementary and discriminative information across multiple scales. The proposed CMFT not only aligns the two modalities into a shared modality-invariant space but also captures the temporal memory to ensure motion-invariance. To mitigate the adverse effects of modality gaps, we propose a progressive learning scheme that first introduces the Modality-Shared Refinement Loss (MSRL) to guide the CMFT towards uncovering more reliable identity-related information from features shared across modalities. Then, a Modality Discriminative Loss (MDCL) is used to tackle the challenges of significant intra-class and minimal inter-class variation. MSRL combined with MDL enhances the discriminative power of reliable features. Importantly, our CMFT model exhibits generality and scalability, as evidenced by significant performance improvements when applied to different combinations of multimodal inputs. Experimental results validate that CMFT effectively leverages the complementary semantic information in visible and infrared inputs, outperforming the existing visible–infrared-based reID methods on the HITSZ-VCM, SYSU-MM01, and RegID datasets.