ABSTRACT The Remote Sensing Image Change Captioning (RSICC) has recently emerged in the field of remote sensing image interpretation; it aims to automatically predict natural language captions of significant semantic changes in bi-temporal remote sensing images. Recent studies of RSICC have improved the accuracy of change captions of bi-temporal remote sensing images to a large extent. Nevertheless, there still remain challenges in multi-scale perception of ground objects and feature enhancement of bi-temporal remote sensing images. To address these challenges and further improve the accuracy of RSICC, a novel deep learning–based end-to-end scale-wised feature enhancement network (SFEN) is proposed in this paper. SFEN integrates four efficient blocks: 1) the siamese backbone network (SBN) to extract initial features of bi-temporal remote sensing images, 2) the siamese receptive field fusion (SRFF) block to explicitly capture multi-scale semantic information of ground objects in bi-temporal feature maps, 3) the siamese global feature enhancement (SGFE) block to adaptively enhance key information and filtering redundant features of bi-temporal feature maps in both channel and spatial dimensions, 4) the change caption decoder (CCD) to map bi-temporal feature maps into natural language. The SFEN aims to precisely capture significant semantic information of ground objects in bi-temporal remote sensing images and predict accurate change captions. Experimental results on LEVIR-CC dataset demonstrate our SFEN outperforms recent state-of-the-art (SOTA) approach in RSICC by 5.2% on CIDEr-D and achieves a new SOTA.