Recent learning-based multi-view stereo (MVS) approaches have shown excellent performance. These approaches typically train a deep neural network to estimate dense depth maps from multiple images. However, most of these approaches require large-scale dense depth maps as the supervisory signals during training. This paper proposes a self-supervised learning framework for MVS, which learns to estimate dense depth maps from multiple images without dense depth supervision. Taking an arbitrary number of images as input, we produce sparse depth maps using structure from motion and use it as self-supervision. We apply reconstruction and smoothness losses to regions where there is no sparse depth. For stable training, we introduce a pseudo-depth loss, which is the difference between depth maps estimated by the network with the current and past parameters. Experimental results on multiple datasets demonstrate the effectiveness of our self-supervised learning framework.
Read full abstract