Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching
The deep multi-view stereo (MVS) and stereo matching approaches generally construct 3D cost volumes to regularize and regress the output depth or disparity. These methods are limited when high-resolution outputs are needed since the memory and time costs grow cubically as the volume resolution increases. In this paper, we propose a both memory and time efficient cost volume formulation that is complementary to existing multi-view stereo and stereo matching approaches based on 3D cost volumes. First, the proposed cost volume is built upon a standard feature pyramid encoding geometry and context at gradually finer scales. Then, we can narrow the depth (or disparity) range of each stage by the depth (or disparity) map from the previous stage. With gradually higher cost volume resolution and adaptive adjustment of depth (or disparity) intervals, the output is recovered in a coarser to fine manner. We apply the cascade cost volume to the representative MVS-Net, and obtain a 35.6% improvement on DTU benchmark (1st place), with 50.6% and 59.3% reduction in GPU memory and run-time. It is also the state-of-the-art learning-based method on Tanks and Temples benchmark. The statistics of accuracy, run-time and GPU memory on other representative stereo CNNs also validate the effectiveness of our proposed method. Our source code is available at https://github.com/alibaba/cascade-stereo.
- Research Article
4
- 10.1109/access.2023.3273903
- Jan 1, 2025
- IEEE Access
Extensive studies have been conducted on multi-view stereo and stereo matching for 3D reconstruction, whereas relatively few methods have been proposed for a large-scale environment. The difficulty of producing high-resolution depth/disparity maps is one of the main reasons. In this paper, we propose a dual attention-guided self-adaptive aware cascade network (DAscNet) that achieves state-of-the-art results for generating high-resolution depth/disparity maps of complex scenes by introducing a cascade inference strategy using a set of input views. A pyramid cost volume fusion and a self-adaptive cost volume cascade are built upon a dual attention-guided context multi-scale feature extraction encoding geometric, spatial and contextual information at gradually finer scales to achieve robust structural representation for predictions. The dual attention-guided context multi-scale feature extraction is made up of two distinct modules that are both based on the attention mechanism. In the pyramid cost volume fusion, an inter-cost attention aggregation module fuses multiple low-resolution dense cost volumes to achieve a robust structural representation for initial predictions. In the self-adaptive cost volume cascade, a changeable depth/disparity range estimation module is employed to alter the depth/disparity searching range interval of following stage based on the prediction information from the previous stage. This module can drive the network to gradually deal with complicated matching ambiguities and make better the accuracy of depth/disparity searching range interval prediction. Experiments on two publicly available datasets, the Tanks and Temples dataset and the DTU dataset, show that DAscNet outperforms prior work. The effectiveness of our proposed method is also supported by statistics on the accuracy, runtime, and GPU memory of other representative methods.
- Research Article
7
- 10.3390/s24082400
- Apr 9, 2024
- Sensors (Basel, Switzerland)
With the widespread adoption of modern RGB cameras, an abundance of RGB images is available everywhere. Therefore, multi-view stereo (MVS) 3D reconstruction has been extensively applied across various fields because of its cost-effectiveness and accessibility, which involves multi-view depth estimation and stereo matching algorithms. However, MVS tasks face noise challenges because of natural multiplicative noise and negative gain in algorithms, which reduce the quality and accuracy of the generated models and depth maps. Traditional MVS methods often struggle with noise, relying on assumptions that do not always hold true under real-world conditions, while deep learning-based MVS approaches tend to suffer from high noise sensitivity. To overcome these challenges, we introduce LNMVSNet, a deep learning network designed to enhance local feature attention and fuse features across different scales, aiming for low-noise, high-precision MVS 3D reconstruction. Through extensive evaluation of multiple benchmark datasets, LNMVSNet has demonstrated its superior performance, showcasing its ability to improve reconstruction accuracy and completeness, especially in the recovery of fine details and clear feature delineation. This advancement brings hope for the widespread application of MVS, ranging from precise industrial part inspection to the creation of immersive virtual environments.
- Research Article
5
- 10.6688/jise.2015.31.1.7
- Jan 1, 2015
- Journal of Information Science and Engineering
This paper presents a stochastic optimization based 3D dense reconstruction from multiple views. Accuracy and completeness are two major measure indices for performance evaluation of various multi-view stereo (MVS) algorithms. First, the reconstruction accuracy is highly related to the stereo mismatches over the multiple views. Stereo mismatches occur in the image regions involving the lack of texture, depth discontinuity, or repeated texture patterns. Second, an insufficient number of views or occlusions between objects also lead to the difficulty in matching so that the reconstruction completeness degrades. In pursuit of high accuracy and completeness we present the appropriate techniques to solve the above problems in the reconstruction task. To deal with the various stereo mismatch problems we propose to apply adaptive matching functions and allow partial matching. We shall model the object to be reconstructed by a set of 3D oriented planar patches covering the visible object surface. The adopted multi-view reconstruction is formulated as a patch expansion process under a tree hierarchy. In order to find the optimal patches via multi-view stereo matching we shall employ a PSO (Particle Swarm Optimization) method for the sake of implementation simplicity and avoidance of possible local traps as found in the derivative based optimization methods. The success in the PSO method relies on imposing proper constraints on ranges of the patch parameters including the patch depth and patch normal vector which are involved in the PSO objective function (i.e., the stereo matching function). Furthermore, we use a varying patch size to obtain the reliable patches in the areas containing less texture, repeated texture pattern, or depth discontinuity. To secure a high reconstruction quality we advocate a patch priority queue to select the best patch during the patch expansion. All of the above mentioned techniques are also effective in the situations when the number of views is sparse or the camera baseline width is wide. The proposed method is tested on synthetic and real image data sets. The experimental results indicate that the proposed method is superior or comparable to the top ranked reconstruction methods reported in the public Middlebury MVS evaluation website.
- Research Article
123
- 10.1609/aaai.v34i07.6939
- Apr 3, 2020
- Proceedings of the AAAI Conference on Artificial Intelligence
Deep learning has shown to be effective for depth inference in multi-view stereo (MVS). However, the scalability and accuracy still remain an open problem in this domain. This can be attributed to the memory-consuming cost volume representation and inappropriate depth inference. Inspired by the group-wise correlation in stereo matching, we propose an average group-wise correlation similarity measure to construct a lightweight cost volume. This can not only reduce the memory consumption but also reduce the computational burden in the cost volume filtering. Based on our effective cost volume representation, we propose a cascade 3D U-Net module to regularize the cost volume to further boost the performance. Unlike the previous methods that treat multi-view depth inference as a depth regression problem or an inverse depth classification problem, we recast multi-view depth inference as an inverse depth regression task. This allows our network to achieve sub-pixel estimation and be applicable to large-scale scenes. Through extensive experiments on DTU dataset and Tanks and Temples dataset, we show that our proposed network with Correlation cost volume and Inverse DEpth Regression (CIDER1), achieves state-of-the-art results, demonstrating its superior performance on scalability and accuracy.
- Research Article
8
- 10.3389/feart.2023.1108403
- Apr 13, 2023
- Frontiers in Earth Science
Introduction: The stereo matching technology of satellite imagery is an important way to reconstruct real world. Most stereo matching technologies for satellite imagery are based on depth learning. However, the existing depth learning based methods have the problems of holes and matching errors in stereo matching tasks.Methods: In order to improve the effect of satellite image stereo matching results, we propose a satellite image stereo matching network based on attention mechanism (A-SATMVSNet). To solve the problem of insufficient extraction of surface features, a new feature extraction module based on triple dilated convolution with attention module is proposed, which solves the problem of matching holes caused by insufficient extraction of surface features. At the same time, compared with the traditional weighted average method, we design a novel cost-volume method that integrates attention mechanism to reduce the impact of matching errors to improve the accuracy of matching.Results and discussion: Experiments on public multi-view stereo matching dataset based on satellite imagery demonstrate that the proposed method significantly improves the accuracy and outperforms various previous methods. Our source code is available at https://github.com/MVSer/A-SATMVSNet.
- Dissertation
- 10.14711/thesis-991012786067603412
- Jan 1, 2019
Multi-view stereo (MVS) reconstructs 3D representations of the scene from imagery, which is a core problem of computer vision extensively studied for decades. Traditionally, MVS algorithms apply hand-crafted similarity metrics and engineered regularizations to compute dense correspondences. While these methods have shown great results under ideal Lambertian scenarios, classical MVS algorithms still suffer from numerous artifacts. In this thesis, we propose to advance the MVS reconstruction using recent deep learning techniques. First, we present an end-to-end deep learning architecture, MVSNet, for depth map inference from multi-view images. The key contribution of this part is the careful integration between multi-view geometries and convolutional neural networks (CNNs). In the network, we extract deep image features and build the 3D cost volume upon the camera frustum via the differentiable homography warping. Then, 3D convolutions are applied to regularize and regress the output depth map. We demonstrate on DTU dataset that MVSNet significantly outperforms previous state-of-the-arts in both reconstruction completeness and overall quality. Next, we propose to extend the MVSNet architecture for large-scale MVS reconstruction. One major limitation of current learning-based approaches is the scalability: the memory-consuming cost volume regularization makes the learned MVS hard to be applied to high-resolution scenes. To this end, we sequentially regularize 2D cost maps via the gated recurrent unit (GRU) rather than regularize the entire 3D cost volume in one go. The GRU regularization dramatically reduces memory consumption and makes high-resolution reconstructions feasible. The proposed R-MVSNet is evaluated on the large-scale Tanks and Temples dataset and achieves comparable results to classical large-scale MVS algorithms. Finally, we establish a large-scale synthetic MVS dataset, BlendedMVS, based on blended images and rendered depth maps. While several MVS datasets have been proposed, they fail to provide accurate depth and occlusion information as ground truth mesh models are usually incomplete. We therefore establish a new MVS dataset based on model rendering. Textured meshes are first reconstructed from images of different scenes, which are then rendered into color images, depth maps and occlusion maps. We further blend rendered images with input images using high-pass and low-pass filters to generate our training input. Extensive experiments demonstrate that models trained on BlendedMVS achieve significant better generalization ability compared with models trained on other MVS datasets. In sum, this thesis presents a complete learning-based solution to large-scale multi-view stereopsis, including a current baseline network (MVSNet), its large-scale extension (R-MVSNet) and a large-scale synthetic dataset (BlendedMVS). We bridge the gap between classical MVS reconstructions and recent deep learning techniques and demonstrate the effectiveness of the learning-based MVS through extensive experiments on different datasets.
- Conference Article
113
- 10.1109/cvpr42600.2020.00609
- Jun 1, 2020
A great deal of research has demonstrated recently that multi-view stereo (MVS) matching can be solved with deep learning methods. However, these efforts were focused on close-range objects and only a very few of the deep learning-based methods were specifically designed for large-scale 3D urban reconstruction due to the lack of multi-view aerial image benchmarks. In this paper, we present a synthetic aerial dataset, called the WHU dataset, we created for MVS tasks, which, to our knowledge, is the first large-scale multi-view aerial dataset. It was generated from a highly accurate 3D digital surface model produced from thousands of real aerial images with precise camera parameters. We also introduce in this paper a novel network, called RED-Net, for wide-range depth inference, which we developed from a recurrent encoder-decoder structure to regularize cost maps across depths and a 2D fully convolutional network as framework. RED-Net’s low memory requirements and high performance make it suitable for large-scale and highly accurate 3D Earth surface reconstruction. Our experiments confirmed that not only did our method exceed the current state-of-the-art MVS methods by more than 50% mean absolute error (MAE) with less memory and computational cost, but its efficiency as well. It outperformed one of the best commercial software programs based on conventional methods, improving their efficiency 16 times over. Moreover, we proved that our RED-Net model pre-trained on the synthetic WHU dataset can be efficiently transferred to very different multi-view aerial image datasets without any fine-tuning. Dataset and code are available at http://gpcv.whu.edu.cn/data.
- Book Chapter
37
- 10.1007/978-3-031-19821-2_42
- Jan 1, 2022
We address multiview stereo (MVS), an important 3D vision task that reconstructs a 3D model such as a dense point cloud from multiple calibrated images. We propose CER-MVS (Cascaded Epipolar RAFT Multiview Stereo), a new approach based on the RAFT (Recurrent All-Pairs Field Transforms) architecture developed for optical flow. CER-MVS introduces five new changes to RAFT: epipolar cost volumes, cost volume cascading, multiview fusion of cost volumes, dynamic supervision, and multiresolution fusion of depth maps. CER-MVS is significantly different from prior work in multiview stereo. Unlike prior work, which operates by updating a 3D cost volume, CER-MVS operates by updating a disparity field. Furthermore, we propose an adaptive thresholding method to balance the completeness and accuracy of the reconstructed point clouds. Experiments show that our approach achieves state-of-the-art performance on the DTU and Tanks-and-Temples benchmarks (both intermediate and advanced set). Code is available at https://github.com/princeton-vl/CER-MVS.KeywordMultiview stereo
- Research Article
1
- 10.1371/journal.pone.0314418
- Feb 13, 2025
- PloS one
This paper introduces an innovative multi-view stereo matching network-the Multi-Step Depth Enhancement Refine Network (MSDER-MVS), aimed at improving the accuracy and computational efficiency of high-resolution 3D reconstruction. The MSDER-MVS network leverages the potent capabilities of modern deep learning in conjunction with the geometric intuition of traditional 3D reconstruction techniques, with a particular focus on optimizing the quality of the depth map and the efficiency of the reconstruction process.Our key innovations include a dual-branch fusion structure and a Feature Pyramid Network (FPN) to effectively extract and integrate multi-scale features. With this approach, we construct depth maps progressively from coarse to fine, continuously improving depth prediction accuracy at each refinement stage. For cost volume construction, we employ a variance-based metric to integrate information from multiple perspectives, optimizing the consistency of the estimates. Moreover, we introduce a differentiable depth optimization process that iteratively enhances the quality of depth estimation using residuals and the Jacobian matrix, without the need for additional learnable parameters. This innovation significantly increases the network's convergence rate and the fineness of depth prediction.Extensive experiments on the standard DTU dataset (Aanas H, 2016) show that MSDER-MVS surpasses current advanced methods in accuracy, completeness, and overall performance metrics. Particularly in scenarios rich in detail, our method more precisely recovers surface details and textures, demonstrating its effectiveness and superiority for practical applications.Overall, the MSDER-MVS network offers a robust solution for precise and efficient 3D scene reconstruction. Looking forward, we aim to extend this approach to more complex environments and larger-scale datasets, further enhancing the model's generalization and real-time processing capabilities, and promoting the widespread deployment of multi-view stereo matching technology in practical applications.
- Research Article
22
- 10.1609/aaai.v37i3.25368
- Jun 26, 2023
- Proceedings of the AAAI Conference on Artificial Intelligence
Self-supervised monocular methods can efficiently learn depth information of weakly textured surfaces or reflective objects. However, the depth accuracy is limited due to the inherent ambiguity in monocular geometric modeling. In contrast, multi-frame depth estimation methods improve depth accuracy thanks to the success of Multi-View Stereo (MVS), which directly makes use of geometric constraints. Unfortunately, MVS often suffers from texture-less regions, non-Lambertian surfaces, and moving objects, especially in real-world video sequences without known camera motion and depth supervision. Therefore, we propose MOVEDepth, which exploits the MOnocular cues and VElocity guidance to improve multi-frame Depth learning. Unlike existing methods that enforce consistency between MVS depth and monocular depth, MOVEDepth boosts multi-frame depth learning by directly addressing the inherent problems of MVS. The key of our approach is to utilize monocular depth as a geometric priority to construct MVS cost volume, and adjust depth candidates of cost volume under the guidance of predicted camera velocity. We further fuse monocular depth and MVS depth by learning uncertainty in the cost volume, which results in a robust depth estimation against ambiguity in multi-view geometry. Extensive experiments show MOVEDepth achieves state-of-the-art performance: Compared with Monodepth2 and PackNet, our method relatively improves the depth accuracy by 20% and 19.8% on the KITTI benchmark. MOVEDepth also generalizes to the more challenging DDAD benchmark, relatively outperforming ManyDepth by 7.2%. The code is available at https://github.com/JeffWang987/MOVEDepth.
- Research Article
27
- 10.1007/s40747-023-01106-3
- Jun 7, 2023
- Complex & Intelligent Systems
Deep learning has recently been proven to deliver excellent performance in multi-view stereo (MVS). However, it is difficult for deep learning-based MVS approaches to balance their efficiency and effectiveness. Towards this end, we propose the DSC-MVSNet, a novel coarse-to-fine and end-to-end framework for more efficient and more accurate depth estimation in MVS. In particular, we propose an attention aware 3D UNet-shape network, which first uses the depthwise separable convolutions for cost volume regularization. This mechanism enables effective aggregation of information and significantly reduces the model parameters and computation by transforming the ordinary convolution on cost volume as depthwise convolution and pointwise convolution. Besides, a 3D-Attention module is proposed to alleviate the feature mismatching problem in cost volume regularization and aggregate the important information of cost volume in three dimensions (i.e. channel, space, and depth). Moreover, we propose an efficient Feature Transfer Module to upsample the low-resolution (LR) depth map to a high-resolution (HR) depth map to achieve higher accuracy. With extensive experiments on two benchmark datasets, i.e. DTU and Tanks & Temples, we demonstrate that the parameters of our model are significantly reduced to 25% of the state-of-the-art model MVSNet. Besides, our method outperforms or maintains on par accuracy with the state-of-the-art models. Our source code is available at https://github.com/zs670980918/DSC-MVSNet.
- Research Article
156
- 10.1109/tpami.2023.3296163
- Jan 1, 2023
- IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning-based multi-view stereo (MVS) has by far centered around 3D convolution on cost volumes. Due to the high computation and memory consumption of 3D CNN, the resolution of output depth is often considerably limited. Different from most existing works dedicated to adaptive refinement of cost volumes, we opt to directly optimize the depth value along each camera ray, mimicking the range (depth) finding of a laser scanner. This reduces the MVS problem to ray-based depth optimization which is much more light-weight than full cost volume optimization. In particular, we propose RayMVSNet which learns sequential prediction of a 1D implicit field along each camera ray with the zero-crossing point indicating scene depth. This sequential modeling, conducted based on transformer features, essentially learns the epipolar line search in traditional multi-view stereo. We devise a multi-task learning for better optimization convergence and depth accuracy. We found the monotonicity property of the SDFs along each ray greatly benefits the depth estimation. Our method ranks top on both the DTU and the Tanks & Temples datasets over all previous learning-based methods, achieving an overall reconstruction score of 0.33 mm on DTU and an F-score of 59.48% on Tanks & Temples. It is able to produce high-quality depth estimation and point cloud reconstruction in challenging scenarios such as objects/scenes with non-textured surface, severe occlusion, and highly varying depth range. Further, we propose RayMVSNet++ to enhance contextual feature aggregation for each ray through designing an attentional gating unit to select semantically relevant neighboring rays within the local frustum around that ray. This improves the performance on datasets with more challenging examples (e.g., low-quality images caused by poor lighting conditions or motion blur). RayMVSNet++ achieves state-of-the-art performance on the ScanNet dataset. In particular, it attains an AbsRel of 0.058m and produces accurate results on the two subsets of textureless regions and large depth variation.
- Conference Article
128
- 10.1109/cvpr52688.2022.00840
- Jun 1, 2022
Learning-based multi-view stereo (MVS) has by far cen-tered around 3D convolution on cost volumes. Due to the high computation and memory consumption of 3D CNN, the resolution of output depth is often considerably limited. Differentfrom most existing works dedicated to adaptive re-finement of cost volumes, we opt to directly optimize the depth value along each camera ray, mimicking the range (depth) finding of a laser scanner. This reduces the MVS problem to ray-based depth optimization which is much more light-weight than full cost volume optimization. In particular, we propose RayMVSNet which learns sequen-tial prediction of aID implicit field along each camera ray with the zero-crossing point indicating scene depth. This sequential modeling, conducted based on transformer features, essentially learns the epipolar line search in traditional multi-view stereo. We also devise a multi-task learning for better optimization convergence and depth accuracy. Our method ranks top on both the DTU and the Tanks & Temples datasets over all previous learning-based methods, achieving overall reconstruction score of 0.33mm on DTU andf-score of59.48% on Tanks & Temples.
- Research Article
27
- 10.1016/j.patcog.2022.109198
- Nov 22, 2022
- Pattern Recognition
Prior depth-based multi-view stereo network for online 3D model reconstruction
- Research Article
7
- 10.3390/s22155500
- Jul 23, 2022
- Sensors (Basel, Switzerland)
While recent deep learning-based stereo-matching networks have shown outstanding advances, there are still some unsolved challenges. First, most state-of-the-art stereo models employ 3D convolutions for 4D cost volume aggregation, which limit the deployment of networks for resource-limited mobile environments owing to heavy consumption of computation and memory. Although there are some efficient networks, most of them still require a heavy computational cost to incorporate them to mobile computing devices in real-time. Second, most stereo networks indirectly supervise cost volumes through disparity regression loss by using the softargmax function. This causes problems in ambiguous regions, such as the boundaries of objects, because there are many possibilities for unreasonable cost distributions which result in overfitting problem. A few works deal with this problem by generating artificial cost distribution using only the ground truth disparity value that is insufficient to fully regularize the cost volume. To address these problems, we first propose an efficient multi-scale sequential feature fusion network (MSFFNet). Specifically, we connect multi-scale SFF modules in parallel with a cross-scale fusion function to generate a set of cost volumes with different scales. These cost volumes are then effectively combined using the proposed interlaced concatenation method. Second, we propose an adaptive cost-volume-filtering (ACVF) loss function that directly supervises our estimated cost volume. The proposed ACVF loss directly adds constraints to the cost volume using the probability distribution generated from the ground truth disparity map and that estimated from the teacher network which achieves higher accuracy. Results of several experiments using representative datasets for stereo matching show that our proposed method is more efficient than previous methods. Our network architecture consumes fewer parameters and generates reasonable disparity maps with faster speed compared with the existing state-of-the art stereo models. Concretely, our network achieves 1.01 EPE with runtime of 42 ms, 2.92 M parameters, and 97.96 G FLOPs on the Scene Flow test set. Compared with PSMNet, our method is 89% faster and 7% more accurate with 45% fewer parameters.