This paper investigates the problem of an effective and robust fusion representation method for foreground moving object detection. Many fusion representation learning approaches pay attention to the similarity measurement between the fusion results and source images in texture details and pixel intensity, ignoring the harmful information, e.g., noise, blur, and extreme illumination. Therefore, the aggregated features of infrared and visible images will introduce much harmful information, affecting the model performance of downstream visual tasks. This paper tackles these problems by proposing a contrastive fusion representation learning method for the foreground moving object detection task, which consists of two major modules: the upstream fusion representation module (FRM) and the downstream foreground moving object detection module (FODM). Unlike the traditional fusion optimization mechanism, the former aims to extract valuable features and reject harmful features via the maximum mutual information theory. The latter is a siamese convolutional neural network to detect foreground moving objects by aggregating the time-sequence images generated by FRM. Experimental results and comparisons with the state-of-the-art on three public datasets (i.e., TNO, MF, and cross-modal FOD dataset of infrared and visible images), validate the effectiveness, robustness, and overall superiority of the proposed contrast fusion representation learning method. Concisely, our contrastive fusion representation learning has gained 53.9%,43.2%,46.4%,52.3%,2.2%,87.1%,3.5% on EI, SF, DF, AG, MI, and Nabf metrics compared with the best competitors.