Abstract

We propose a novel learning-based method for multi-view stereo (MVS) depth estimation capable of recovering depth from images taken from known, but unconstrained, views. Existing MVS methods extract features from each image independently before projecting them onto a set of planes at candidate depths to compute matching costs. By projecting features after extraction, networks must learn rotation and scale invariant representations even though the relative poses of the cameras are known. In our approach, we compensate for viewpoint changes directly in the extraction layers, allowing the network to learn features that are projected by construction and reducing the need for rotation and scale invariance.Compensating for viewpoint changes naively, however, can be computationally expensive as the feature layers must either be applied multiple times (once per depth hypothesis), or replaced by 3D convolutions. We overcome this limitation in two ways. First, we only compute our matching cost volume at a coarse image scale before upsampling and refining the outputs. Second, we incrementally compute our projected features such that the bulk of the layers need only be executed a single time across all depth hypotheses. The combination of these two techniques allows our method to perform competitively with the state-of-the-art, while being significantly faster. We call our method MultiViewStereoNet and release our source code publicly for the benefit of the robotics community.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call