Winter wheat is a significant global food crop, and it is crucial to monitor its distribution for better agricultural management, land planning, and environmental sustainability. However, the distribution style of winter wheat planting fields is not consistent due to different terrain conditions. In mountainous areas, winter wheat planting units are smaller in size and fragmented in distribution compared to plain areas. Unfortunately, most crop-mapping research based on deep learning ignores the impact of topographic relief on crop distribution and struggles to handle hilly areas effectively. In this paper, we propose a cross-modal segmentation network for winter wheat mapping in complex terrain using remote-sensing multi-temporal images and DEM data. First, we propose a diverse receptive fusion (DRF) module, which applies a deformable receptive field to optical images during the feature fusion process, allowing it to match winter wheat plots of varying scales and a fixed receptive field to the DEM to extract evaluation features at a consistent scale. Second, we developed a distributed weight attention (DWA) module, which can enhance the feature intensity of winter wheat, thereby reducing the omission rate of planting areas, especially for the small-sized regions in hilly terrain. Furthermore, to demonstrate the performance of our model, we conducted extensive experiments and ablation studies on a large-scale dataset in Lanling county, Shandong province, China. Our results show that our proposed CM-Net is effective in mapping winter wheat in complex terrain.