Abstract Currently available thermal image depth estimation methods are difficult to efficiently extract fine multi-scale feature information from thermal images and suffer from the problem of blurring details at the edges of the estimated depth map. To address these challenges, this paper proposes MSDFNet, a multi-scale detail feature fusion encoder–decoder network, for self-supervised monocular thermal image depth estimation. The model is based on a channel expansion hourglass residual lightweight feature encoder, which can capture rich and fine-grained multi-scale feature information with low computational effort. MSDFNet utilizes a detail feature weight evaluation decoder to fuse cross-scale features and reevaluate the importance of each feature, thereby emphasizing critical edge information at multiple scales. Additionally, MSDFNet incorporates a depth consistency loss function, which provides self-supervised signals for the detailed features of thermal images and improves the optimization of network performance. The method is applied to the ViViD++ and MS2 datasets and achieves state-of-the-art depth estimation performance compared to existing state-of-the-art algorithms. In the Indoor Dark scenario of the ViViD++ dataset, the Abs Rel, Sq Rel, RMSE, and RMSE log error metric values of MSDFNet are reduced by 6.71%, 11.92%, 9.09%, and 5.73%, respectively, while the accuracy metric values δ < 1.25 i , i = 1 , 2 , 3 were improved by 4.18%, 1.13%, and 0.2%, respectively. In addition, MSDFNet proves its excellent generalization ability on the MS2 dataset. The Abs Rel and RMSE error values in the night scene are reduced by 45.6% and 30.09%, respectively, and the accuracy δ < 1.25 i , i = 1 , 3 is improved by 20.95% and 1.33%, respectively. The Abs Rel and RMSE values in the rainy day scenario are reduced by 1.33% and 1.21%, respectively, and the accuracy δ < 1.25 i , i = 1 , 3 is improved by 0.24% and 0.83%, respectively.
Read full abstract