ABSTRACT The extraction of roads from UAV images is challenged by lighting, noise, occlusions, and similar non-road objects, making high-quality road extraction difficult. To addressing these issues, this study proposes an enhanced U-Net network to automate the extraction and 3D modeling of real road scenes using UAV imagery. Initially, a cascaded atrous spatial pyramid module was integrated into the encoder to capitalize on global context information, thereby refining the fuzzy segmentation outcomes. Subsequently, a module for augmenting road feature extraction was added within the channel, and a spatial attention mechanism was introduced in the decoder to enhance edge clarity. Experimental results demonstrated that this model captures more road information compared to mainstream networks and effectively incorporates topological structure perception for road extraction in complex scenarios, thus improving road connectivity. The model achieved an F1 score and mean Intersection over Union (mIoU) of 85.6% and 81.2%, respectively, on UAV images of road scenes – marking improvements of 3.9% and 3.4% over the traditional U-Net model, thereby exhibiting superior automatic road extraction capabilities. Ultimately, the model facilitated refined modeling and visual analysis of road scenes, achieving high overall accuracy and detailed local restoration of the actual scene.