ABSTRACT Height map estimation from a single aerial image plays a crucial role in localization, mapping, and 3D object detection. Deep convolutional neural networks have been used to predict height information from single-view remote sensing images, but these methods rely on large volumes of training data and often overlook geometric features present in orthographic images. To address these issues, this study proposes a gradient-based self-supervised learning network with momentum contrastive loss to extract geometric information from non-labeled images in the pretraining stage. Additionally, novel local implicit constraint layers are used at multiple decoding stages in the proposed supervised network to refine high-resolution features in height estimation. The structural-aware loss is also applied to improve the robustness of the network to positional shift and minor structural changes along the boundary area. Experimental evaluation on the ISPRS benchmark datasets shows that the proposed method outperforms other baseline networks, with minimum MAE and RMSE of 0.116 and 0.289 for the Vaihingen dataset and 0.077 and 0.481 for the Potsdam dataset, respectively. The proposed method also shows around threefold data efficiency improvements on the Potsdam dataset and domain generalization on the Enschede datasets. These results demonstrate the effectiveness of the proposed method in height map estimation from single-view remote sensing images.