Semantic segmentation of remote sensing images is a fundamental task in computer vision, with significant applications in forest and farmland cover surveys, geological disaster monitoring, and other related fields. The inclusion of digital surface models can enhance the segmentation performance compared to using unimodal imaging alone. However, most existing methods simply combine the features from both modalities without considering their differences and complementarity, leading to a loss of spatial details. In order to address this issue and improve segmentation accuracy, we propose a novel network called Progressive Adjacent-Layer Coordination Symmetric Cascade Network (PACSCNet). This network employs a two-stage fusion symmetric cascade encoder to leverage the similarities and differences between adjacent features for cross-layer fusion, thereby preserving spatial details. Additionally, our network includes a new dual-pyramid symmetric cascade decoder that extracts similarities in multimodal and cross-layer fusion features for merging. Furthermore, a pyramid residual integration module progressively integrates features at four scales to mitigate noise interference during large-scale fusion. Extensive experimental evaluations on the Vaihingen and Potsdam datasets demonstrate that PACSCNet achieves strong semantic segmentation performance, comparable to state-of-the-art approaches, in terms of accuracy and intersection-over-union. The source code and results of our proposed PACSCNet are publicly available at https://github.com/F8AoMn/PACSCNet.