This study delves into the problem of moving object detection in infrared and visible images. While existing approaches primarily focus on single-task detection using single-spectral image data, such as thermal infrared or visible images, they often ignore the differences between different spectral images and the interconnectedness of related tasks such as image fusion and segmentation. To tackle these problems, we present a novel multi-spectral image fusion network with quality and semantic awareness for moving object detection (MOD), particularly in scenarios where ground truth labels for both infrared and visible images are unavailable. Our approach fuses multi-spectral images, incorporating additional subtasks in the image fusion to obtain content and quality perception of infrared and visible images. In addition, we design a novel residual global perception module (RGPM) and multi-spectral fusion loss, which can capture more hidden features and contextual information across various scales. This enhanced capability leads to more precise detection and tracking of moving objects, particularly in challenging situations involving occlusions and dynamic backgrounds. Compared with single-spectral moving object detection optimization, the hurdles of utilizing deep learning for multi-spectral image fusion, e.g., without ground truth labels and harmful noise, are significantly mitigated. Extensive quantitative and qualitative comparative experiments demonstrate its effectiveness, robustness, and superior performance compared to contemporary methods. Concisely, the proposed fusion representation learning has gained 44.2%,31.2%,36.3%,41.6%,5.3%,31.2%,76.2% on EI, SF, DF, AG, MI, SD and Nabf metrics compared with the best competitors.