ABSTRACT Automated building extraction is imperative for several geospatial applications such as monitoring disaster-affected buildings and urban planning. Existing deep learning (DL)-based building extraction methods fail to capture high-level semantic features due to the complex nature and diverse appearance of visually similar structures. To address this issue, in this letter, we propose an enhanced multi-scale attentive feature fusion network (EMAFF-Net) for building extraction from remote sensing (RS) images. EMAFF-Net is an end-to-end DL architecture based on U-Net that includes: i) an encoder; ii) an enhanced multi-scale feature fusion (EMFF) module; iii) a refined multi-scale convolutional block attention (RM-CBAM) module and iv) a decoder with refinement layers. To extract multi-scale contextual information, we incorporate an RM-CBAM module into the lateral connections of encoder-decoder layers of EMAFF-Net. Further, a novel EMFF module is integrated to obtain fine-grained features from the lowest encoder layer with minimal trainable parameters required. We evaluate the performance of the proposed network on two benchmark datasets: Massachusetts (MAS) and WHU building datasets. The experimental results show that the proposed approach outperforms the existing reference methods showcasing its potential in practical applications.