Self-Supervised Multi-View Stereo with Adaptive Depth Priors
Although supervised multi-view 3D reconstruction methods have achieved satisfying performance recently, there are major limitations such as high costs for 3D data collection and poor generalization to unseen scenes. Hence, unsupervised 3D reconstruction approaches based on photometric consistency are being explored. However, variations in lighting conditions among different views and reflective surfaces within a scene can undermine the reliability of these approaches. In this paper, we propose adaptive depth priors as pseudo-labels to guide the optimization process of self-supervised multiview stereo. First, sparse depth priors are generated based on the conventional structure from motion (SfM) and multi-view stereo (MVS) algorithms, which are then fed into a monocular depth estimation network to learn the adapted depth priors. Besides, a spatial-frequency fusion structure is designed to enhance global perception in the feature matching of MVS by combining local dependency from spatial domain with global contextual information in the frequency domain. Extensive experiments on DTU and Tanks & Temples datasets demonstrate that the proposed ADP-MVSNet achieves markedly improved results over the existing unsupervised approaches and even outperforms some supervised methods.
- Conference Article
55
- 10.1109/iccv.2019.00114
- Oct 1, 2019
Highly accurate 3D volumetric reconstruction is still an open research topic where the main difficulty is usually related to merging some rough estimations with high frequency details. One of the most promising methods is the fusion between multi-view stereo and photometric stereo images. Beside the intrinsic difficulties that multi-view stereo and photometric stereo in order to work reliably, supplementary problems arise when considered together. In this work, we present a volumetric approach to the multi-view photometric stereo problem. The key point of our method is the signed distance field parameterisation and its relation to the surface normal. This is exploited in order to obtain a linear partial differential equation which is solved in a variational framework, that combines multiple images from multiple points of view in a single system. In addition, the volumetric approach is naturally implemented on an octree, which allows for fast ray-tracing that reliably alleviates occlusions and cast shadows. Our approach is evaluated on synthetic and real data-sets and achieves state-of-the-art results.
- Conference Article
19
- 10.1109/wacv56688.2023.00314
- Jan 1, 2023
Multi-view photometric stereo (MVPS) is a preferred method for detailed and precise 3D acquisition of an object from images. Although popular methods for MVPS can provide outstanding results, they are often complex to execute and limited to isotropic material objects. To address such limitations, we present a simple, practical approach to MVPS, which works well for isotropic as well as other object material types such as anisotropic and glossy. The proposed approach in this paper exploits the benefit of uncertainty modeling in a deep neural network for a reliable fusion of photometric stereo (PS) and multi-view stereo (MVS) network predictions. Yet, contrary to the recently proposed state-of-the-art, we introduce neural volume rendering methodology for a trustworthy fusion of MVS and PS measurements. The advantage of introducing neural volume rendering is that it helps in the reliable modeling of objects with diverse material types, where existing MVS methods, PS methods, or both may fail. Furthermore, it allows us to work on neural 3D shape representation, which has recently shown outstanding results for many geometric processing tasks. Our suggested new loss function aims to fit the zero level set of the implicit neural function using the most certain MVS and PS network predictions coupled with weighted neural volume rendering cost. The proposed approach shows state-of-the-art results when tested extensively on several benchmark datasets.
- Conference Article
850
- 10.1109/cvpr42600.2020.00257
- Jun 1, 2020
The deep multi-view stereo (MVS) and stereo matching approaches generally construct 3D cost volumes to regularize and regress the output depth or disparity. These methods are limited when high-resolution outputs are needed since the memory and time costs grow cubically as the volume resolution increases. In this paper, we propose a both memory and time efficient cost volume formulation that is complementary to existing multi-view stereo and stereo matching approaches based on 3D cost volumes. First, the proposed cost volume is built upon a standard feature pyramid encoding geometry and context at gradually finer scales. Then, we can narrow the depth (or disparity) range of each stage by the depth (or disparity) map from the previous stage. With gradually higher cost volume resolution and adaptive adjustment of depth (or disparity) intervals, the output is recovered in a coarser to fine manner. We apply the cascade cost volume to the representative MVS-Net, and obtain a 23.1% improvement on DTU benchmark (1st place), with 50.6% and 74.2% reduction in GPU memory and run-time. It is also the state-of-the-art learning-based method on Tanks and Temples benchmark. The statistics of accuracy, run-time and GPU memory on other representative stereo CNNs also validate the effectiveness of our proposed method.
- Conference Article
30
- 10.1109/cvpr52688.2022.01227
- Jun 1, 2022
This paper presents a simple and effective solution to the longstanding classical multi-view photometric stereo (MVPS) problem. It is well-known that photometric stereo (PS) is excellent at recovering high-frequency surface details, whereas multi-view stereo (MVS) can help remove the low-frequency distortion due to PS and retain the global geometry of the shape. This paper proposes an approach that can effectively utilize such complementary strengths of PS and MVS. Our key idea is to combine them suitably while considering the per-pixel uncertainty of their estimates. To this end, we estimate per-pixel surface normals and depth using an uncertainty-aware deep-PS network and deep-MVS network, respectively. Uncertainty modeling helps select reliable surface normal and depth estimates at each pixel which then act as a true representative of the dense surface geometry. At each pixel, our approach either selects or discards deep-PS and deep-MVS network prediction depending on the prediction uncertainty measure. For dense, detailed, and precise inference of the object's surface profile, we propose to learn the implicit neural shape representation via a multilayer perceptron (MLP). Our approach encourages the MLP to converge to a natural zero-level set surface using the confident prediction from deep-PS and deep-MVS networks, providing superior dense surface reconstruction. Extensive experiments on the DiLiGenT-MV benchmark dataset show that our method provides high-quality shape recovery with a much lower memory footprint while outperforming almost all of the existing approaches.
- Conference Article
1
- 10.1109/3dtv.2009.5069646
- May 1, 2009
Multiview stereo image composition mainly depends on the type of the multiview stereo display device. Currently, multiview LCD optical plate autostereoscopic display device is common in the art, while the composition method is limited. A new general multiview LCD stereo image composition method is proposed in this paper based on the optical plate LCD stereo display device. The proposed method mainly consists of three steps: sub-pixel judgment, sub-sampling of sub-pixel of each view, arrangement and composition of sub-pixels. The proposed method covers all possible cases of the optical plate LCD stereo display device. It has good universality and applicability. The feasibility of the proposed method is verified on the detailed stereo display device.
- Conference Article
54
- 10.1109/iccv.2013.148
- Dec 1, 2013
We propose a method for accurate 3D shape reconstruction using uncalibrated multiview photometric stereo. A coarse mesh reconstructed using multiview stereo is first parameterized using a planar mesh parameterization technique. Subsequently, multiview photometric stereo is performed in the 2D parameter domain of the mesh, where all geometric and photometric cues from multiple images can be treated uniformly. Unlike traditional methods, there is no need for merging view-dependent surface normal maps. Our key contribution is a new photometric stereo based mesh refinement technique that can efficiently reconstruct meshes with extremely fine geometric details by directly estimating a displacement texture map in the 2D parameter domain. We demonstrate that intricate surface geometry can be reconstructed using several challenging datasets containing surfaces with specular reflections, multiple albedos and complex topologies.
- Conference Article
3
- 10.1109/ssiai.2010.5483892
- Jan 1, 2010
We explore the use of Distributed Ray Tracing (DRT), an anti-aliasing technique from computer graphics, in multi-view computational stereo. As an example, we study ABM, a multi-view stereo algorithm based on a set of Hough transform accumulation operations. Augmenting ABM with DRT improves both internal signal quality and reconstruction accuracy. Results are given for both fundamental and complex “super-resolution reconstruction” tasks, where the voxel side length is less than the image ground sample distance. DRT improves ABM accuracy by 18% and can be generalized to improve other stereo algorithms.
- Research Article
2
- 10.5194/isprs-archives-xlviii-1-w2-2023-1075-2023
- Dec 13, 2023
- The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Abstract. In this paper, we propose a method for performing 3D reconstruction by generating virtual RPC parameters from multi-view satellite stereo images provided by Google Earth (GE) software. In the multi-view stereo (MVS) image in a general case, after the pose and parameters of the camera are estimated, a dense 3D surface can be reconstructed. However, in the case of satellite images, it is not easy to obtain the original images with pose parameters of an area of interest. In the case of GE software, which can obtain images across the globe, the images provided are georeferenced and modified to fit the ground control point (GCP), so there is no camera model to explain the projection relationship. Therefore, the purpose of the proposed method is to perform 3D reconstruction by generating virtual camera parameters in modified satellite images obtained from GE software. In the proposed method, satellite images obtained from GE are estimated to be pinhole images using structure from motion (SfM) for initial reconstruction. After initial reconstruction, the 3D model is transformed from a distorted hexahedral space formed along a pixel ray to a UTM coordinate system metric space through a 3D homography-based georeferencing. A virtual rational polynomial camera (RPC) parameter is calculated through the satellite images and the 3D interspace correspondence point of UTM coordinates. The result is generated by virtual RPC and the MVS method using the RPC model. The reconstructed DSM using virtual RPC is improved over the initial reconstruction of the proposed process, and error measurement in the area with GT obtained significant results with an average of 1.366m on an MAE method.
- Research Article
4
- 10.1109/access.2023.3273903
- Jan 1, 2025
- IEEE Access
Extensive studies have been conducted on multi-view stereo and stereo matching for 3D reconstruction, whereas relatively few methods have been proposed for a large-scale environment. The difficulty of producing high-resolution depth/disparity maps is one of the main reasons. In this paper, we propose a dual attention-guided self-adaptive aware cascade network (DAscNet) that achieves state-of-the-art results for generating high-resolution depth/disparity maps of complex scenes by introducing a cascade inference strategy using a set of input views. A pyramid cost volume fusion and a self-adaptive cost volume cascade are built upon a dual attention-guided context multi-scale feature extraction encoding geometric, spatial and contextual information at gradually finer scales to achieve robust structural representation for predictions. The dual attention-guided context multi-scale feature extraction is made up of two distinct modules that are both based on the attention mechanism. In the pyramid cost volume fusion, an inter-cost attention aggregation module fuses multiple low-resolution dense cost volumes to achieve a robust structural representation for initial predictions. In the self-adaptive cost volume cascade, a changeable depth/disparity range estimation module is employed to alter the depth/disparity searching range interval of following stage based on the prediction information from the previous stage. This module can drive the network to gradually deal with complicated matching ambiguities and make better the accuracy of depth/disparity searching range interval prediction. Experiments on two publicly available datasets, the Tanks and Temples dataset and the DTU dataset, show that DAscNet outperforms prior work. The effectiveness of our proposed method is also supported by statistics on the accuracy, runtime, and GPU memory of other representative methods.
- Book Chapter
2
- 10.1007/978-3-642-38267-3_20
- Jan 1, 2013
Surface reconstruction using patch-based multi-view stereo commonly assumes that the underlying surface is locally planar. This is typically not true so that least-squares fitting of a planar patch leads to systematic errors which are of particular importance for multi-scale surface reconstruction. In a recent paper [12], we determined the modulation transfer function of a classical patch-based stereo system. Our key insight was that the reconstructed surface is a box-filtered version of the original surface. Since the box filter is not a true low-pass filter this causes high-frequency artifacts. In this paper, we propose an extended reconstruction model by weighting the least-squares fit of the 3D patch. We show that if the weighting function meets specified criteria the reconstructed surface is the convolution of the original surface with that weighting function. A choice of particular interest is the Gaussian which is commonly used in image and signal processing but left unexploited by many multi-view stereo algorithms. Finally, we demonstrate the effects of our theoretic findings using experiments on synthetic and real-world data sets.Keywordsmulti-view stereomulti-scale surface reconstruction
- Conference Article
31
- 10.1109/wacv51458.2022.00402
- Jan 1, 2022
We present a modern solution to the multi-view photometric stereo problem (MVPS). Our work suitably exploits the image formation model in a MVPS experimental setup to recover the dense 3D reconstruction of an object from images. We procure the surface orientation using a photometric stereo (PS) image formation model and blend it with a multi-view neural radiance field representation to recover the object's surface geometry. Contrary to the previous multi-staged framework to MVPS, where the position, iso-depth contours, or orientation measurements are estimated independently and then fused later, our method is simple to implement and realize. Our method performs neural rendering of multi-view images while utilizing surface normals estimated by a deep photometric stereo network. We render the MVPS images by considering the object's surface normals for each 3D sample point along the viewing direction rather than explicitly using the density gradient in the volume space via 3D occupancy information. We optimize the proposed neural radiance field representation for the MVPS setup efficiently using a fully connected deep network to recover the 3D geometry of an object. Extensive evaluation on the DiLiGenT-MV benchmark dataset shows that our method performs better than the approaches that perform only PS or only multi-view stereo (MVS) and provides comparable results against the state-of-the-art multistage fusion methods.
- Research Article
2
- 10.1109/access.2020.3004431
- Jan 1, 2020
- IEEE Access
Image-based rendering (IBR) attempts to synthesize novel views using a set of observed images. Some IBR approaches (such as light fields) have yielded impressive high-quality results on small-scale scenes with dense photo capture. However, available wide-baseline IBR methods are still restricted by the low geometric accuracy and completeness of multi-view stereo (MVS) reconstruction on low-textured and non-Lambertian surfaces. The issues become more significant in large-scale outdoor scenes due to challenging scene content, e.g., buildings, trees, and sky. To address these problems, we present a novel IBR algorithm that consists of two key components. First, we propose a novel depth refinement method that combines MVS depth maps with monocular depth maps predicted via deep learning. A lookup table remap is proposed for converting the scale of the monocular depths to be consistent with the scale of the MVS depths. Then, the rescaled monocular depth is used as the constraint in the minimum spanning tree (MST)-based nonlocal filter to refine the per-view MVS depth. Second, we present an efficient shape-preserving warping algorithm that uses superpixels to generate the warped images and blend expected novel views of scenes. The proposed method has been evaluated on public MVS and view synthesis datasets, as well as newly captured large-scale outdoor datasets. In comparison with state-of-the-art methods, the experimental results demonstrated that the proposed method can obtain more complete and reliable depth maps for the challenging large-scale outdoor scenes, thereby resulting in more promising novel view synthesis.
- Conference Article
91
- 10.1109/cvpr52688.2022.01265
- Jun 1, 2022
Multi-view Stereo (MVS) with known camera parameters is essentially a 1D search problem within a valid depth range. Recent deep learning-based MVS methods typically densely sample depth hypotheses in the depth range, and then construct prohibitively memory-consuming 3D cost volumes for depth prediction. Although coarse-to-fine sampling strategies alleviate this overhead issue to a certain extent, the efficiency of MVS is still an open challenge. In this work, we propose a novel method for highly efficient MVS that remarkably decreases the memory footprint, meanwhile clearly advancing state-of-the-art depth prediction performance. We investigate what a search strategy can be reasonably optimal for MVS taking into account of both efficiency and effectiveness. We first formulate MVS as a binary search problem, and accordingly propose a generalized binary search network for MVS. Specifically, in each step, the depth range is split into 2 bins with extra 1 error tolerance bin on both sides. A classification is performed to identify which bin contains the true depth. We also design three mechanisms to respectively handle classification errors, deal with out-of-range samples and decrease the training memory. The new formulation makes our method only sample a very small number of depth hypotheses in each step, which is highly memory efficient, and also greatly facilitates quick training convergence. Experiments on competitive benchmarks show that our method achieves state-of-the-art accuracy with much less memory. Particularly, our method obtains an overall score of 0.289 on DTU dataset and tops the first place on challenging Tanks and Temples advanced dataset among all the learning-based methods. Our code will be released at https://github.com/MiZhenxing/GBi-Net.
- Research Article
2
- 10.1088/1755-1315/1486/1/012020
- Apr 1, 2025
- IOP Conference Series: Earth and Environmental Science
This study examines 3D modeling techniques, emphasizing the advantages of Neural Radiance Field (NeRF) over Multiview Stereo (MVS) in reconstructing accurate models. By employing Principal Component Analysis (PCA), we compare point clouds from both methods to evaluate their quality and distinguish true representations from noise artifacts in practical applications. This approach allows for a detailed assessment of reconstruction quality, highlighting how various factors such as lighting conditions, surface features, and material properties impact the accuracy and density of the resulting 3D models. While NeRF sometimes exhibits a higher point density, MVS demonstrates superior performance, particularly when dealing with homogeneous textures, yielding denser point clouds and more accurate representations. The analysis shows that MVS excels in data density, feature extraction, and noise reduction, resulting in consistently cleaner models. In contrast, NeRF, despite its high data density, is adversely affected by significant noise and outliers, which obscure object details. Both methods achieve satisfactory levels of object completeness; however, MVS outperforms NeRF in detail sharpness, surface smoothness, and overall clarity. This comparison underscores the critical influence of texture and surface characteristics on the effectiveness of 3D reconstruction techniques, affirming MVS’s advantages in producing reliable and accurate representations of 3D objects.
- Conference Article
14
- 10.1109/icra40945.2020.9197089
- May 1, 2020
Multi-view stereo (MVS) algorithms have been commonly used to model large-scale structures. When processing MVS, image acquisition is an important issue because its reconstruction quality depends heavily on the acquired images. Recently, an explore-then-exploit strategy has been used to acquire images for MVS. This method first constructs a coarse model by exploring an entire scene using a pre-allocated camera trajectory. Then, it rescans the unreconstructed regions from the coarse model. However, this strategy is inefficient because of the frequent overlap of the initial and rescanning trajectories. Furthermore, given the complete coverage of images, MVS algorithms do not guarantee an accurate reconstruction result.In this study, we propose a novel view path-planning method based on an online MVS system. This method aims to incrementally construct the target three-dimensional (3D) model in real time. View paths are continually planned based on online feedbacks from the partially constructed model. The obtained paths fully cover low-quality surfaces while maximizing the reconstruction performance of MVS. Experimental results demonstrate that the proposed method can construct high quality 3D models with one exploration trial, without any rescanning trial as in the explore-then-exploit method.