LNMVSNet: A Low-Noise Multi-View Stereo Depth Inference Method for 3D Reconstruction
With the widespread adoption of modern RGB cameras, an abundance of RGB images is available everywhere. Therefore, multi-view stereo (MVS) 3D reconstruction has been extensively applied across various fields because of its cost-effectiveness and accessibility, which involves multi-view depth estimation and stereo matching algorithms. However, MVS tasks face noise challenges because of natural multiplicative noise and negative gain in algorithms, which reduce the quality and accuracy of the generated models and depth maps. Traditional MVS methods often struggle with noise, relying on assumptions that do not always hold true under real-world conditions, while deep learning-based MVS approaches tend to suffer from high noise sensitivity. To overcome these challenges, we introduce LNMVSNet, a deep learning network designed to enhance local feature attention and fuse features across different scales, aiming for low-noise, high-precision MVS 3D reconstruction. Through extensive evaluation of multiple benchmark datasets, LNMVSNet has demonstrated its superior performance, showcasing its ability to improve reconstruction accuracy and completeness, especially in the recovery of fine details and clear feature delineation. This advancement brings hope for the widespread application of MVS, ranging from precise industrial part inspection to the creation of immersive virtual environments.
- Conference Article
867
- 10.1109/cvpr42600.2020.00257
- Jun 1, 2020
The deep multi-view stereo (MVS) and stereo matching approaches generally construct 3D cost volumes to regularize and regress the output depth or disparity. These methods are limited when high-resolution outputs are needed since the memory and time costs grow cubically as the volume resolution increases. In this paper, we propose a both memory and time efficient cost volume formulation that is complementary to existing multi-view stereo and stereo matching approaches based on 3D cost volumes. First, the proposed cost volume is built upon a standard feature pyramid encoding geometry and context at gradually finer scales. Then, we can narrow the depth (or disparity) range of each stage by the depth (or disparity) map from the previous stage. With gradually higher cost volume resolution and adaptive adjustment of depth (or disparity) intervals, the output is recovered in a coarser to fine manner. We apply the cascade cost volume to the representative MVS-Net, and obtain a 35.6% improvement on DTU benchmark (1st place), with 50.6% and 59.3% reduction in GPU memory and run-time. It is also the state-of-the-art learning-based method on Tanks and Temples benchmark. The statistics of accuracy, run-time and GPU memory on other representative stereo CNNs also validate the effectiveness of our proposed method. Our source code is available at https://github.com/alibaba/cascade-stereo.
- Dissertation
- 10.14711/thesis-991012786067603412
- Jan 1, 2019
Multi-view stereo (MVS) reconstructs 3D representations of the scene from imagery, which is a core problem of computer vision extensively studied for decades. Traditionally, MVS algorithms apply hand-crafted similarity metrics and engineered regularizations to compute dense correspondences. While these methods have shown great results under ideal Lambertian scenarios, classical MVS algorithms still suffer from numerous artifacts. In this thesis, we propose to advance the MVS reconstruction using recent deep learning techniques. First, we present an end-to-end deep learning architecture, MVSNet, for depth map inference from multi-view images. The key contribution of this part is the careful integration between multi-view geometries and convolutional neural networks (CNNs). In the network, we extract deep image features and build the 3D cost volume upon the camera frustum via the differentiable homography warping. Then, 3D convolutions are applied to regularize and regress the output depth map. We demonstrate on DTU dataset that MVSNet significantly outperforms previous state-of-the-arts in both reconstruction completeness and overall quality. Next, we propose to extend the MVSNet architecture for large-scale MVS reconstruction. One major limitation of current learning-based approaches is the scalability: the memory-consuming cost volume regularization makes the learned MVS hard to be applied to high-resolution scenes. To this end, we sequentially regularize 2D cost maps via the gated recurrent unit (GRU) rather than regularize the entire 3D cost volume in one go. The GRU regularization dramatically reduces memory consumption and makes high-resolution reconstructions feasible. The proposed R-MVSNet is evaluated on the large-scale Tanks and Temples dataset and achieves comparable results to classical large-scale MVS algorithms. Finally, we establish a large-scale synthetic MVS dataset, BlendedMVS, based on blended images and rendered depth maps. While several MVS datasets have been proposed, they fail to provide accurate depth and occlusion information as ground truth mesh models are usually incomplete. We therefore establish a new MVS dataset based on model rendering. Textured meshes are first reconstructed from images of different scenes, which are then rendered into color images, depth maps and occlusion maps. We further blend rendered images with input images using high-pass and low-pass filters to generate our training input. Extensive experiments demonstrate that models trained on BlendedMVS achieve significant better generalization ability compared with models trained on other MVS datasets. In sum, this thesis presents a complete learning-based solution to large-scale multi-view stereopsis, including a current baseline network (MVSNet), its large-scale extension (R-MVSNet) and a large-scale synthetic dataset (BlendedMVS). We bridge the gap between classical MVS reconstructions and recent deep learning techniques and demonstrate the effectiveness of the learning-based MVS through extensive experiments on different datasets.
- Research Article
4
- 10.1109/access.2023.3273903
- Jan 1, 2025
- IEEE Access
Extensive studies have been conducted on multi-view stereo and stereo matching for 3D reconstruction, whereas relatively few methods have been proposed for a large-scale environment. The difficulty of producing high-resolution depth/disparity maps is one of the main reasons. In this paper, we propose a dual attention-guided self-adaptive aware cascade network (DAscNet) that achieves state-of-the-art results for generating high-resolution depth/disparity maps of complex scenes by introducing a cascade inference strategy using a set of input views. A pyramid cost volume fusion and a self-adaptive cost volume cascade are built upon a dual attention-guided context multi-scale feature extraction encoding geometric, spatial and contextual information at gradually finer scales to achieve robust structural representation for predictions. The dual attention-guided context multi-scale feature extraction is made up of two distinct modules that are both based on the attention mechanism. In the pyramid cost volume fusion, an inter-cost attention aggregation module fuses multiple low-resolution dense cost volumes to achieve a robust structural representation for initial predictions. In the self-adaptive cost volume cascade, a changeable depth/disparity range estimation module is employed to alter the depth/disparity searching range interval of following stage based on the prediction information from the previous stage. This module can drive the network to gradually deal with complicated matching ambiguities and make better the accuracy of depth/disparity searching range interval prediction. Experiments on two publicly available datasets, the Tanks and Temples dataset and the DTU dataset, show that DAscNet outperforms prior work. The effectiveness of our proposed method is also supported by statistics on the accuracy, runtime, and GPU memory of other representative methods.
- Research Article
32
- 10.1088/1757-899x/1073/1/012066
- Feb 1, 2021
- IOP Conference Series: Materials Science and Engineering
The development of the Information and Computer Technology (ICT) sector, three-dimensional (3D) technology is also growing rapidly. Currently, the need to visualize 3D objects is widely used in animation and graphic applications, architecture, education, cultural recognition and Virtual Reality. 3D modeling of historic buildings has become a concern in recent years. 3D reconstruction is an attempt to document reconstruction or restoration if the building is destroyed. By using the 3D model reconstruction using Structure from Motion (SFM) and Multi View Stereo (MVS) algorithm based on Computer Vision, it is hoped that the results of this 3D modeling can be utilized as an effort to preserve 3D objects in the Penataran Temple cultural heritage area. This research was conducted by taking as many as 61 images of objects in the Blitar Penataran Temple area. The photos obtained were reconstructed into a 3D model using the Structure From Motion algorithm in the meshroom. This research a trial of the original image with a compressed image for reconstruction is used to compare the 3D reconstruction process from the two input data. From 61 images processed using the Structure Form Motion algorithm, 33 poses of camera pose and 3D points were improved, both original and compressed images. The number of iterations compresses 1.4% less than the original image and takes 43.53% faster than the original image.
- Research Article
1
- 10.1371/journal.pone.0314418
- Feb 13, 2025
- PloS one
This paper introduces an innovative multi-view stereo matching network-the Multi-Step Depth Enhancement Refine Network (MSDER-MVS), aimed at improving the accuracy and computational efficiency of high-resolution 3D reconstruction. The MSDER-MVS network leverages the potent capabilities of modern deep learning in conjunction with the geometric intuition of traditional 3D reconstruction techniques, with a particular focus on optimizing the quality of the depth map and the efficiency of the reconstruction process.Our key innovations include a dual-branch fusion structure and a Feature Pyramid Network (FPN) to effectively extract and integrate multi-scale features. With this approach, we construct depth maps progressively from coarse to fine, continuously improving depth prediction accuracy at each refinement stage. For cost volume construction, we employ a variance-based metric to integrate information from multiple perspectives, optimizing the consistency of the estimates. Moreover, we introduce a differentiable depth optimization process that iteratively enhances the quality of depth estimation using residuals and the Jacobian matrix, without the need for additional learnable parameters. This innovation significantly increases the network's convergence rate and the fineness of depth prediction.Extensive experiments on the standard DTU dataset (Aanas H, 2016) show that MSDER-MVS surpasses current advanced methods in accuracy, completeness, and overall performance metrics. Particularly in scenarios rich in detail, our method more precisely recovers surface details and textures, demonstrating its effectiveness and superiority for practical applications.Overall, the MSDER-MVS network offers a robust solution for precise and efficient 3D scene reconstruction. Looking forward, we aim to extend this approach to more complex environments and larger-scale datasets, further enhancing the model's generalization and real-time processing capabilities, and promoting the widespread deployment of multi-view stereo matching technology in practical applications.
- Research Article
6
- 10.5194/isprs-archives-xlviii-1-w3-2023-123-2023
- Oct 19, 2023
- The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Abstract. 3D reconstruction from single and multi-view stereo images is still an open research topic, despite the high number of solutions proposed in the last decades. The surge of deep learning methods has then stimulated the development of new methods using monocular (MDE, Monocular Depth Estimation), stereoscopic and Multi-View Stereo (MVS) 3D reconstruction, showing promising results, often comparable to or even better than traditional methods. The more recent development of NeRF (Neural Radial Fields) has further triggered the interest for this kind of solution. Most of the proposed approaches, however, focus on terrestrial applications (e.g., autonomous driving or small artefacts 3D reconstructions), while airborne and UAV acquisitions are often overlooked. The recent introduction of new datasets, such as UseGeo has, therefore, given the opportunity to assess how state-of-the-art MDE, MVS and NeRF 3D reconstruction algorithms perform using airborne UAV images, allowing their comparison with LiDAR ground truth. This paper aims to present the results achieved by two MDE, two MVS and two NeRF approaches levering deep learning approaches, trained and tested using the UseGeo dataset. This work allows the comparison with a ground truth showing the current state of the art of these solutions and providing useful indications for their future development and improvement.
- Research Article
1
- 10.3390/s24196397
- Oct 2, 2024
- Sensors (Basel, Switzerland)
With FaSS-MVS, we present a fast, surface-aware semi-global optimization approach for multi-view stereo that allows for rapid depth and normal map estimation from monocular aerial video data captured by unmanned aerial vehicles (UAVs). The data estimated by FaSS-MVS, in turn, facilitate online 3D mapping, meaning that a 3D map of the scene is immediately and incrementally generated as the image data are acquired or being received. FaSS-MVS is composed of a hierarchical processing scheme in which depth and normal data, as well as corresponding confidence scores, are estimated in a coarse-to-fine manner, allowing efficient processing of large scene depths, such as those inherent in oblique images acquired by UAVs flying at low altitudes. The actual depth estimation uses a plane-sweep algorithm for dense multi-image matching to produce depth hypotheses from which the actual depth map is extracted by means of a surface-aware semi-global optimization, reducing the fronto-parallel bias of Semi-Global Matching (SGM). Given the estimated depth map, the pixel-wise surface normal information is then computed by reprojecting the depth map into a point cloud and computing the normal vectors within a confined local neighborhood. In a thorough quantitative and ablative study, we show that the accuracy of the 3D information computed by FaSS-MVS is close to that of state-of-the-art offline multi-view stereo approaches, with the error not even an order of magnitude higher than that of COLMAP. At the same time, however, the average runtime of FaSS-MVS for estimating a single depth and normal map is less than 14% of that of COLMAP, allowing us to perform online and incremental processing of full HD images at 1-2 Hz.
- Research Article
2
- 10.1016/j.ophoto.2025.100089
- Apr 1, 2025
- ISPRS Open Journal of Photogrammetry and Remote Sensing
Image-based 3D reconstruction offers realistic scene representation for applications that require accurate geometric information. Although the assumption that images are simultaneously captured, perfectly posed and noise-free simplifies the 3D reconstruction, this rarely holds in real-world settings. A real-world scene comprises multiple objects which obstruct each other and certain object parts are occluded, thus it can be challenging to generate a complete and accurate geometry. Being a part of our environment, we are particularly interested in vegetation that often obscures important structures, leading to incomplete reconstruction of the underlying features. In this contribution, we present a comparative analysis of the geometry behind vegetation occlusions reconstructed by traditional Multi-View Stereo (MVS) and radiance field methods, namely: Neural Radiance Fields (NeRFs), 3D Gaussian Splatting (3DGS) and 2D Gaussian Splatting (2DGS). Excluding certain image parts and investigating how different level of vegetation occlusions affect the geometric reconstruction, we consider Synthetic masks with different occlusion coverage of 10% (Very Sparse), 30% (Sparse), 50% (Medium), 70% (Dense) and 90% (Very Dense). To additionally demonstrate the impact of spatially consistent 3D occlusions, we use Natural masks (up to 35%) where the vegetation is stationary in the 3D scene, but relative to the view-point. Our investigations are based on real-world scenarios; one occlusion-free indoor scenario, on which we apply the Synthetic masks and one outdoor scenario, from which we derive the Natural masks. The qualitative and quantitative 3D evaluation is based on point cloud comparison against a ground truth mesh addressing accuracy and completeness. The conducted experiments and results demonstrate that although MVS shows lowest accuracy errors in both scenarios, the completeness manifests a sharp decline as the occlusion percentage increases, eventually failing under Very Dense masks. NeRFs manifest robustness in the reconstruction with highest completeness considering masks, although the accuracy proportionally decreases with increasing the occlusions. 2DGS achieves second best accuracy results outperforming NeRFs and 3DGS, indicating a consistent performance across different occlusion scenarios. Additionally, by using MVS for initialization, 3DGS and 2DGS completeness improves without significantly sacrificing the accuracy, due to the more densely reconstructed homogeneous areas. We demonstrate that radiance field methods can compete against traditional MVS, showing robust performance for a complete reconstruction under vegetation occlusions. • Vegetation occlusions for 3D reconstruction with MVS, NeRFs and GS. • 2DGS achieves second best accuracy results outperforming NeRFs. • Comprehensive qualitative and quantitative 3D evaluation.
- Conference Article
1
- 10.54941/ahfe1003624
- Jan 1, 2023
- AHFE international
Multi-view stereo (MVS) 3D reconstruction based on deep learning has achieved great success, however, it requires a very high quality and quantity of datasets compared with other computer vision tasks. Current 3D datasets have great limitations in the reconstruction of industrial products, including low accuracy, few types of styles, and few pairwise image models. In this paper, we introduce a new dataset for MVS 3D Model Reconstruction, focusing on the watch wristband category. Better than the existing available open-source watch and wristband dataset, ours contains more than 1k multi-view high-resolution images and high-precision 3D models, covering cartoon, mechanical, vintage, etc. Most importantly, ours can be used directly for deep learning-based MVS 3D reconstruction, because besides three views of real images, we drew line sketches of the three views, and then match them to the high-precision 3D model one by one. At last, we train the MVS network based on deep learning with our dataset as input and supervision. The experiments show that we achieve significant results, and verify the effectiveness of reconstruction in the watch wristband category.
- Research Article
128
- 10.1016/j.displa.2021.102102
- Oct 9, 2021
- Displays
Multi-view stereo in the Deep Learning Era: A comprehensive review
- Research Article
27
- 10.1016/j.tust.2023.105345
- Aug 2, 2023
- Tunnelling and Underground Space Technology
A low-cost 3D reconstruction and measurement system based on structure-from-motion (SFM) and multi-view stereo (MVS) for sewer pipelines
- Research Article
5
- 10.6688/jise.2015.31.1.7
- Jan 1, 2015
- Journal of Information Science and Engineering
This paper presents a stochastic optimization based 3D dense reconstruction from multiple views. Accuracy and completeness are two major measure indices for performance evaluation of various multi-view stereo (MVS) algorithms. First, the reconstruction accuracy is highly related to the stereo mismatches over the multiple views. Stereo mismatches occur in the image regions involving the lack of texture, depth discontinuity, or repeated texture patterns. Second, an insufficient number of views or occlusions between objects also lead to the difficulty in matching so that the reconstruction completeness degrades. In pursuit of high accuracy and completeness we present the appropriate techniques to solve the above problems in the reconstruction task. To deal with the various stereo mismatch problems we propose to apply adaptive matching functions and allow partial matching. We shall model the object to be reconstructed by a set of 3D oriented planar patches covering the visible object surface. The adopted multi-view reconstruction is formulated as a patch expansion process under a tree hierarchy. In order to find the optimal patches via multi-view stereo matching we shall employ a PSO (Particle Swarm Optimization) method for the sake of implementation simplicity and avoidance of possible local traps as found in the derivative based optimization methods. The success in the PSO method relies on imposing proper constraints on ranges of the patch parameters including the patch depth and patch normal vector which are involved in the PSO objective function (i.e., the stereo matching function). Furthermore, we use a varying patch size to obtain the reliable patches in the areas containing less texture, repeated texture pattern, or depth discontinuity. To secure a high reconstruction quality we advocate a patch priority queue to select the best patch during the patch expansion. All of the above mentioned techniques are also effective in the situations when the number of views is sparse or the camera baseline width is wide. The proposed method is tested on synthetic and real image data sets. The experimental results indicate that the proposed method is superior or comparable to the top ranked reconstruction methods reported in the public Middlebury MVS evaluation website.
- Research Article
5
- 10.1016/j.cag.2024.103954
- Jun 8, 2024
- Computers & Graphics
Multi-view depth estimation based on multi-feature aggregation for 3D reconstruction
- Supplementary Content
6
- 10.3390/s25185748
- Sep 15, 2025
- Sensors (Basel, Switzerland)
Three-dimensional (3D) reconstruction technology is not only a core and key technology in computer vision and graphics, but also a key force driving the flourishing development of many cutting-edge applications such as virtual reality (VR), augmented reality (AR), autonomous driving, and digital earth. With the rise in novel view synthesis technologies such as Neural Radiation Field (NeRF) and 3D Gaussian Splatting (3DGS), 3D reconstruction is facing unprecedented development opportunities. This article introduces the basic principles of traditional 3D reconstruction methods, including Structure from Motion (SfM) and Multi View Stereo (MVS) techniques, and analyzes the limitations of these methods in dealing with complex scenes and dynamic environments. Focusing on implicit 3D scene reconstruction techniques related to NeRF, this paper explores the advantages and challenges of using deep neural networks to learn and generate high-quality 3D scene rendering from limited perspectives. Based on the principles and characteristics of 3DGS-related technologies that have emerged in recent years, the latest progress and innovations in rendering quality, rendering efficiency, sparse view input support, and dynamic 3D reconstruction are analyzed. Finally, the main challenges and opportunities faced by current 3D reconstruction technology and novel view synthesis technology were discussed in depth, and possible technological breakthroughs and development directions in the future were discussed. This article aims to provide a comprehensive perspective for researchers in 3D reconstruction technology in fields such as digital twins and smart cities, while opening up new ideas and paths for future technological innovation and widespread application.
- Research Article
46
- 10.1016/j.aei.2023.102196
- Sep 28, 2023
- Advanced Engineering Informatics
Improving completeness and accuracy of 3D point clouds by using deep learning for applications of digital twins to civil structures