Semantic segmentation of remote sensing images aims to achieve pixel-level semantic category assignment for input images. This task has achieved significant advances with the rapid development of deep neural network. Most current methods mainly focus on effectively fusing the low-level spatial details and high-level semantic cues. Other methods also propose to incorporate the boundary guidance to obtain boundary preserving segmentation. However, current methods treat the multi-level feature fusion and the boundary guidance as two separate tasks, resulting in sub-optimal solutions. Moreover, due to the large inter-class difference and small intra-class consistency within remote sensing images, current methods often fail to accurately aggregate the long-range contextual cues. These critical issues make current methods fail to achieve satisfactory segmentation predictions, which severely hinder downstream applications. To this end, we first propose a novel boundary guided multi-level feature fusion module to seamlessly incorporate the boundary guidance into the multi-level feature fusion operations. Meanwhile, in order to further enforce the boundary guidance effectively, we employ a geometric-similarity-based boundary loss function. In this way, under the explicit guidance of boundary constraint, the multi-level features are effectively combined. In addition, a channel-wise correlation guided spatial-semantic context aggregation module is presented to effectively aggregate the contextual cues. In this way, subtle but meaningful contextual cues about pixel-wise spatial context and channel-wise semantic correlation are effectively aggregated, leading to spatial-semantic context aggregation. Extensive qualitative and quantitative experimental results on ISPRS Vaihingen and GaoFen-2 datasets demonstrate the effectiveness of the proposed method.