Semantic segmentation of remote sensing images plays an important role in many applications. However, a remote sensing image typically comprises a complex and heterogenous urban landscape with objects in various sizes and materials, which causes challenges to the task. In this work, a novel adaptive fusion network (AFNet) is proposed to improve the performance of very high resolution (VHR) remote sensing image segmentation. To coherently label size-varied ground objects from different categories, we design multilevel architecture with the scale-feature attention module (SFAM). By SFAM, at the location of small objects, low-level features from the shallow layers of convolutional neural network (CNN) are enhanced, whilst for large objects, high-level features from deep layers are enhanced. Thus, the features of size-varied objects could be preserved during fusing features from different levels, which helps to label size-varied objects. As for labeling the category with high intra-class difference and varied scales, the multiscale structure with a scale-layer attention module (SLAM) is utilized to learn representative features, where an adjacent score map refinement module (ACSR) is employed as the classifier. By SLAM, when fusing multiscale features, based on the interested objects scale, feature map from appropriate scale is given greater weights. With such a scale-aware strategy, the learned features can be more representative, which is helpful to distinguish objects for semantic segmentation. Besides, the performance is further improved by introducing several nonlinear layers to the ACSR. Extensive experiments conducted on two well-known public high-resolution remote sensing image data sets show the effectiveness of our proposed model. Code and predictions are available at https://github.com/athauna/AFNet/