Rethinking Monocular Height Estimation From a Classification Task Perspective Leveraging the Vision Transformer

Wenbo Sun,Yifan Liao,Mingchun Lin,Zhi Gao,Biao Yang,Yichen Zhang,Ruifang Zhai

doi:10.1109/lgrs.2022.3222457

Abstract

Height estimation from a single remote sensing image has great potential in generating digital surface models (DSM) efficiently for a quick earth surface reconstruction. Recently, convolutional neural networks (CNN) have emerged as a powerful method to deal with this ill-posed problem. Most existing methods formulate height estimation as a regression problem due to the continuity of object height. However, it is difficult for the model to regress the object heights exactly to the ground-truth values with a wide range. In this letter, we reformulate the height estimation task as a classification task to improve the model performance. Specifically, we discretize the continuous ground-truth height into bins and assign each pixel to a single label according to the bin subdivision. In addition, we propose to generate a unique bin subdivision for each input image adaptively by viewing the bin generation as a set-to-set problem. Compared with the fixed bin subdivision method, a specific bin subdivision for each input image makes the model adaptively focus on the height range that is more probable to occur in the scene of the input image. In our experiments, we qualitatively and quantitatively demonstrate that the proposed method outperforms the state-of-the-art approaches on both Vaihingen and Potsdam datasets.

Full Text