The objective of infrared and visible image fusion is to integrate the features of source images and acquire a fused image that highlights foreground objects and preserves background details. Nonetheless, due to ignoring the mutual information interaction between source images, prevailing image fusion methods always encounter challenges related to incomplete or redundant feature extraction. For this purpose, this study presents a self-supervised fusion framework,employing multi-level contrastive auto-encoding to realize the infrared and visible image fusion (IVIF) task, referred to as SS-MCAE. The SS-MCAE consists of a coarse-to-fine feature extraction network (CFEN) and a fusion network (FN), where CFEN contains a content-aware enhancement module (CEM) and a multi-level contrastive learning autoencoder (MCLA). Specifically, CFEN employs CEM to adaptively enhance brightness information and texture details from source images and the MCLA is introduced to refine and extract these features hierarchically. Within contrastive learning, a neoteric sample generator (SG) is introduced for constructing positive and negative samples, aiming to maximize feature extraction while diminishing redundant information. Furthermore, an innovative perceptual loss is developed to retain original features and guide the image reconstruction, establishing a more reliable relationship between extracted features and source images. Comprehensive experiments reveal that our proposed SS-MCAE is superior to current approaches in both visual effect and quantitative analysis.