In high-resolution remote sensing images, there are areas with weak textures such as large building roofs, which occupy a large number of pixels in the image. These areas pose a challenge for traditional semantic segmentation networks to obtain ideal results. Common strategies like downsampling, patch cropping, and cascade models often sacrifice fine details or global context, resulting in limited accuracy. To address these issues, a novel semantic segmentation framework has been designed specifically for large-format high-resolution remote sensing images by aggregating global and local features in this paper. The framework consists of two branches: one branch deals with low-resolution downsampled images to capture global features, while the other branch focuses on cropped patches to extract high-resolution local details. Also, this paper introduces a feature aggregation module based on the Transformer structure, which effectively aggregates global and local information. Additionally, to save GPU memory usage, a novel three-step training method has been developed. Extensive experiments on two public datasets demonstrate the effectiveness of the proposed approach, with an IoU of 90.83% on the AIDS dataset and 90.30% on the WBDS dataset, surpassing state-of-the-art methods such as DANet, DeepLab v3+, U-Net, ViT, TransUNet, CMTFNet, and UANet.