Cloud obscuration undermines the availability of optical images for continuous monitoring in earth observation. Fusing features from synthetic aperture radar (SAR) has been recognized as a feasible strategy to guide the reconstruction of corrupted signals in cloud-contaminated regions. However, due to the different imaging mechanisms and reflection characteristics, the substantial domain gap between SAR images and optical images makes it a challenging problem to effectively perform cross-modality feature fusion. Although several SAR-assisted cloud removal methods have been proposed, most of them are often unable to achieve adequate information interaction between different modalities, which greatly limits the effectiveness and reliability of cross-modality fusion. In this paper, we proposed a novel hierarchical framework for cloud-free multispectral image reconstruction, which effectively integrates SAR and optical data by a dual-domain interactive attention mechanism. The overall encoder–decoder network is a W-shaped asymmetric structure with a two-branch encoder and one decoder. The encoder branches extract features from SAR images and optical images separately, while the decoder exploits multiscale residual block groups to expand the receptive field and multi-output strategy for reducing the training difficulty. The core cross-modality feature fusion module at the bottleneck adopts the dual-domain interactive attention (DDIA) mechanism which enhances the reciprocal infusion of SAR and optical features to encourage the reconstruction of spectral and structural information. Furthermore, features in the spatial and frequency domains are integrated to improve the effectiveness of the fusion process. To echo the overall network structure, the loss function is designed as a multiscale loss in dual domains. The proposed method can realize sufficient information communication and effective cross-modality fusion between SAR features and optical features. Extensive experiments on the SMILE-CR dataset and SEN12MS-CR dataset demonstrated that the proposed method can outperform the seven representative deep-learning comparative methods in terms of visual performance and quantitative accuracy.