Road extraction techniques based on remote sensing image have significantly advanced. Currently, fully supervised road segmentation neural networks based on remote sensing images require a significant number of densely labeled road samples, limiting their applicability in large-scale scenarios. Consequently, semi-supervised methods that utilize fewer labeled data have gained increasing attention. However, the imbalance between a small quantity of labeled data and a large volume of unlabeled data leads to local detail errors and overall cognitive mistakes in semi-supervised road extraction. To address this challenge, this paper proposes a novel consistency self-training semi-supervised method (CSSnet), which effectively learns from a limited number of labeled data samples and a large amount of unlabeled data. This method integrates self-training semi-supervised segmentation with semi-supervised classification. The semi-supervised segmentation component relies on an enhanced generative adversarial network for semantic segmentation, which significantly reduces local detail errors. The semi-supervised classification component relies on an upgraded mean-teacher network to handle overall cognitive errors. Our method exhibits excellent performance with a modest amount of labeled data. This study was validated on three separate road datasets comprising high-resolution remote sensing satellite images and UAV photographs. Experimental findings showed that our method consistently outperformed state-of-the-art semi-supervised methods and several classic fully supervised methods.