In the context of cross-modal person re-identification, researchers often employ methods that utilize visible modality information to generate both an ‘X’ modality and a grayscale modality, enhancing the accuracy of person re-identification models. A lightweight network causes the ‘X’ modality through self-supervised learning of labels from visible images. In contrast, the grayscale modality is obtained through simple linear accumulation of the three RGB color channels from visual images. It can be observed that both the ‘X’ modality and grayscale modality are derived from visible images, which fails to establish a connection between the visible and infrared modalities. Therefore, this paper proposes an intermediate modality generation module to produce intermediate modality representations dynamically. By combining information from the visible, infrared, and intermediate modalities, the model is encouraged to capture modality-invariant features with cross-modal consistency. This enables person of the same identity to exhibit similar feature representations across different modalities, thereby mitigating the impact of distribution differences between visible and infrared modalities. Additionally, to facilitate the learning of appropriate intermediate modality representations, a distribution migration strategy is introduced. It guides the intermediate modality to maintain the correct distance from the visible and infrared modalities by optimizing the weights of the loss functions, preventing inadequate feature learning caused by an excessive focus on specific modality information. Furthermore, a mixed augmentation approach is proposed to alleviate disparities among multiple modalities further. By randomly cropping and combining regions of visible (infrared) modality images with infrared (visible) modality images, the generalization ability of the model in heterogeneous modalities is enhanced. Extensive comparative experiments are conducted on the SYSU-MM01 and RegDB datasets, yielding mAP values of 57.2% and 85.82%, respectively. The superior mAP performance on the RegDB dataset compared to most existing methods validates the effectiveness of the proposed approach.