Recent years, unsupervised multi-view stereo (MVS) methods have achieved excellent success that can produce comparable results to earlier supervised work. However, as unsupervised MVS uses image reconstruction as pretext task, it faces two vital drawbacks: RGB value, which is the measurement of image, is not robust enough across views due to complicated environment like lighting conditions and reconstruction itself cannot reflect quality of depth estimation linearly. These problems cause the actual optimization goal to diverge from the expected optimization goal, thus could impair the training process. To enhance robustness of pretext task, we propose a contrastive learning based constraint. The constraint adds featuremetric consistency across views by forcing the features between matching points similar and the features between unmatched points opposite. To add linear properties to overall training procedure, we propose a multi-stage training strategy that uses pseudo label as supervision after unsupervised training at the beginning. On the other hand, we adopt an iterative optimizer that proven to be quite effective in supervised MVS to accelerate training. Finally, we conduct a series of experiments on the DTU dataset and Tanks and Temples dataset that demonstrate the efficiency and robustness of our method compared with the state-of-the-art methods in terms of accuracy, completeness and speed.
Read full abstract