Existing multi-resolution infrared and visible image fusion methods suffer from the weak ability of texture detail preservation, which restricts the practical application. In this paper, we proposed a cross-domain multi-resolution infrared and visible image fusion method, CMRFusion, based on auto-encoder networks and a cross-domain attention fusion strategy. Auto-encoder networks are adopted to extract deep multi-scale features with encoder networks and reconstruct images with decoder networks. The cross-domain attention fusion strategy is adopted to promote the preservation of texture detail from one of the source images. In the proposed method, low-resolution infrared images are firstly up-scaled by a simple bicubic strategy to match the resolution of source images. Then, an encoder network is adopted to extract features from infrared and visible images. The extracted features of the infrared image are served as the base and supplemented with details in the extracted features from the visible image through a cross-domain attention fusion strategy to obtain the fused features to reconstruct high-resolution infrared images with the first decoder network. Finally, the encoder network is adopted to extract features from visible and reconstructed infrared images. The extracted features of the visible image are served as the base and supplemented with details in the extracted features from the reconstructed high-resolution infrared image through a cross-domain attention fusion strategy to obtain the fused features to reconstruct the fusion result with the second decoder network. The qualitative and quantitative experiments conducted on the TNO, OSU, and MSRS datasets indicate that CMRFusion can balance the information from source images and well-retain texture detail from the visible image.