Fine-Detailed Neural Indoor Scene Reconstruction Using Multi-Level Importance Sampling And Multi-View Consistency
Recently, neural implicit 3D reconstruction in indoor scenarios has become popular due to its simplicity and impressive performance. Previous works could produce complete results leveraging monocular priors of normal or depth. However, they may suffer from over-smoothed reconstructions and long-time optimization due to unbiased sampling and inaccurate monocular priors. In this paper, we propose a novel neural implicit surface reconstruction method, named FD-NeuS, to learn fine-detailed 3D models using multi-level importance sampling strategy and multi-view consistency methodology. Specifically, we leverage segmentation priors to guide region-based ray sampling, and use piecewise exponential functions as weights to pilot 3 D points sampling along the rays, ensuring more attention on important regions. In addition, we introduce multi-view feature consistency and multi-view normal consistency as supervision and uncertainty respectively, which further improve the reconstruction of details. Extensive quantitative and qualitative results show that FD-NeuS outperforms existing methods in various scenes.
- Research Article
1
- 10.1109/tpami.2025.3607103
- Sep 8, 2025
- IEEE transactions on pattern analysis and machine intelligence
Radiance fields represented by 3D Gaussians excel at synthesizing novel views, offering both high training efficiency and fast rendering. However, with sparse input views, the lack of multi-view consistency constraints results in poorly initialized Gaussians and unreliable heuristics for optimization, leading to suboptimal performance. Existing methods often incorporate depth priors from dense estimation networks but overlook the inherent multi-view consistency in input images. Additionally, they rely on dense initialization, which limits the efficiency of scene representation. To overcome these challenges, we propose a view synthesis framework based on 3D Gaussian Splatting, named MCGS, enabling photorealistic scene reconstruction from sparse views. The key innovations of MCGS in enhancing multi-view consistency are as follows: i) We leverage matching priors from a sparse matcher to initialize Gaussians primarily on textured regions, while low-texture areas are populated with randomly distributed Gaussians. This yields a compact yet sufficient set of initial Gaussians. ii) We propose a multi-view consistency-guided progressive pruning strategy to dynamically eliminate inconsistent Gaussians. This approach confines their optimization to a consistency-constrained space, which ensures robust and coherent scene reconstruction. These strategies enhance robustness to sparse views, accelerate rendering, and reduce memory consumption, making MCGS a practical framework for 3D Gaussian Splatting.
- Conference Article
151
- 10.1109/cvpr52688.2022.00543
- Jun 1, 2022
This paper addresses the challenge of reconstructing 3D indoor scenes from multi-view images. Many previous works have shown impressive reconstruction results on textured objects, but they still have difficulty in handling low-textured planar regions, which are common in indoor scenes. An approach to solving this issue is to incorporate planer constraints into the depth map estimation in multiview stereo-based methods, but the per-view plane estimation and depth optimization lack both efficiency and multiview consistency. In this work, we show that the planar constraints can be conveniently integrated into the recent implicit neural representation-based reconstruction methods. Specifically, we use an MLP network to represent the signed distance function as the scene geometry. Based on the Manhattan-world assumption, planar constraints are employed to regularize the geometry in floor and wall regions predicted by a 2D semantic segmentation network. To resolve the inaccurate segmentation, we encode the semantics of 3D points with another MLP and design a novel loss that jointly optimizes the scene geometry and semantics in 3D space. Experiments on ScanNet and 7-Scenes datasets show that the proposed method outperforms previous methods by a large margin on 3D reconstruction quality. The code and supplementary materials are available at https://zju3dv.github.io/manhattan_sdf.
- Research Article
6
- 10.1109/tpami.2024.3379833
- Sep 1, 2024
- IEEE transactions on pattern analysis and machine intelligence
This paper addresses the challenge of reconstructing 3D indoor scenes from multi-view images. Many previous works have shown impressive reconstruction results on textured objects, but they still have difficulty in handling low-textured planar regions, which are common in indoor scenes. An approach to solving this issue is to incorporate planar constraints into the depth map estimation in multi-view stereo-based methods, but the per-view plane estimation and depth optimization lack both efficiency and multi-view consistency. In this work, we show that the planar constraints can be conveniently integrated into the recent implicit neural representation-based reconstruction methods. Specifically, we use an MLP network to represent the signed distance function as the scene geometry. Based on the Manhattan-world assumption and the Atlanta-world assumption, planar constraints are employed to regularize the geometry in floor and wall regions predicted by a 2D semantic segmentation network. To resolve the inaccurate segmentation, we encode the semantics of 3D points with another MLP and design a novel loss that jointly optimizes the scene geometry and semantics in 3D space. Experiments on ScanNet and 7-Scenes datasets show that the proposed method outperforms previous methods by a large margin on 3D reconstruction quality.
- Conference Article
101
- 10.1109/iccv48922.2021.01404
- Oct 1, 2021
We tackle the challenge of learning a distribution over complex, realistic, indoor scenes. In this paper, we introduce Generative Scene Networks (GSN), which learns to decompose scenes into a collection of many local radiance fields that can be rendered from a free moving camera. Our model can be used as a prior to generate new scenes, or to complete a scene given only sparse 2D observations. Recent work has shown that generative models of radiance fields can capture properties such as multi-view consistency and view-dependent lighting. However, these models are specialized for constrained viewing of single objects, such as cars or faces. Due to the size and complexity of realistic indoor environments, existing models lack the representational capacity to adequately capture them. Our decomposition scheme scales to larger and more complex scenes while preserving details and diversity, and the learned prior enables high-quality rendering from viewpoints that are significantly different from observed viewpoints. When compared to existing models, GSN produces quantitatively higher-quality scene renderings across several different scene datasets.
- Research Article
1
- 10.62762/tis.2025.874668
- Aug 25, 2025
- ICCK Transactions on Intelligent Systematics
This paper proposes a Diffusion Model-Optimized Neural Radiance Field (DT-NeRF) method, aimed at enhancing detail recovery and multi-view consistency in 3D scene reconstruction. By combining diffusion models with Transformers, DT-NeRF effectively restores details under sparse viewpoints and maintains high accuracy in complex geometric scenes. Experimental results demonstrate that DT-NeRF significantly outperforms traditional NeRF and other state-of-the-art methods on the Matterport3D and ShapeNet datasets, particularly in metrics such as PSNR, SSIM, Chamfer Distance, and Fidelity. Ablation experiments further confirm the critical role of the diffusion and Transformer modules in the model's performance, with the removal of either module leading to a decline in performance. The design of DT-NeRF showcases the synergistic effect between modules, providing an efficient and accurate solution for 3D scene reconstruction. Future research may focus on further optimizing the model, exploring more advanced generative models and network architectures to enhance its performance in large-scale dynamic scenes.
- Conference Article
2
- 10.1145/3095140.3095167
- Jun 27, 2017
International audience
- Research Article
31
- 10.1016/j.isprsjprs.2022.02.014
- Mar 3, 2022
- ISPRS Journal of Photogrammetry and Remote Sensing
GeoRec: Geometry-enhanced semantic 3D reconstruction of RGB-D indoor scenes
- Research Article
1
- 10.1111/cgf.14657
- Oct 1, 2022
- Computer Graphics Forum
With the rapid development of data‐driven techniques, data has played an essential role in various computer vision tasks. Many realistic and synthetic datasets have been proposed to address different problems. However, there are lots of unresolved challenges: (1) the creation of dataset is usually a tedious process with manual annotations, (2) most datasets are only designed for a single specific task, (3) the modification or randomization of the 3D scene is difficult, and (4) the release of commercial 3D data may encounter copyright issue. This paper presents MINERVAS, a Massive INterior EnviRonments VirtuAl Synthesis system, to facilitate the 3D scene modification and the 2D image synthesis for various vision tasks. In particular, we design a programmable pipeline with Domain‐Specific Language, allowing users to select scenes from the commercial indoor scene database, synthesize scenes for different tasks with customized rules, and render various types of imagery data, such as color images, geometric structures, semantic labels. Our system eases the difficulty of customizing massive scenes for different tasks and relieves users from manipulating fine‐grained scene configurations by providing user‐controllable randomness using multilevel samplers. Most importantly, it empowers users to access commercial scene databases with millions of indoor scenes and protects the copyright of core data assets, e.g., 3D CAD models. We demonstrate the validity and flexibility of our system by using our synthesized data to improve the performance on different kinds of computer vision tasks. The project page is at https://coohom.github.io/MINERVAS.
- Conference Article
5
- 10.1109/icvr51878.2021.9483856
- May 20, 2021
The 3D reconstruction of weak feature indoor scenes is a challenging task that cannot be effectively solved by existing methods based on image features. In this paper, we propose an indoor scene reconstruction method based on Hector SLAM [1] and floorplan optimization to generate a standard and realistic 3D mesh model. First, we use Hector SLAM [1] and Lidar to generate a 2D grid map and extract the edge points as the initial floorplan by filtering the discrete noise. Second, we classify these edge points into different classes through the region growing algorithm and fit each class with a line. Then we optimize these lines according to the topological regularity of the man-made scene to obtain a standard floorplan. Finally, we combined the optimized floorplan and texture images to generate a realistic 3D Mesh model of the weak feature indoor scene. We evaluate our approach on four weak feature scenes and demonstrate the advantages over existing alternative methods.
- Conference Article
27
- 10.1109/iccv.2013.348
- Dec 1, 2013
Updating a global 3D model with live RGB-D measurements has proven to be successful for 3D reconstruction of indoor scenes. Recently, a Truncated Signed Distance Function (TSDF) volumetric model and a fusion algorithm have been introduced (KinectFusion), showing significant advantages such as computational speed and accuracy of the reconstructed scene. This algorithm, however, is expensive in memory when constructing and updating the global model. As a consequence, the method is not well scalable to large scenes. We propose a new flexible 3D scene representation using a set of planes that is cheap in memory use and, nevertheless, achieves accurate reconstruction of indoor scenes from RGB-D image sequences. Projecting the scene onto different planes reduces significantly the size of the scene representation and thus it allows us to generate a global textured 3D model with lower memory requirement while keeping accuracy and easiness to update with live RGB-D measurements. Experimental results demonstrate that our proposed flexible 3D scene representation achieves accurate reconstruction, while keeping the scalability for large indoor scenes.
- Book Chapter
106
- 10.1007/978-3-031-19824-3_9
- Jan 1, 2022
Reconstructing 3D indoor scenes from 2D images is an important task in many computer vision and graphics applications. A main challenge in this task is that large texture-less areas in typical indoor scenes make existing methods struggle to produce satisfactory reconstruction results. We propose a new method, named NeuRIS, for high-quality reconstruction of indoor scenes. The key idea of NeuRIS is to integrate estimated normal of indoor scenes as a prior in a neural rendering framework for reconstructing large texture-less shapes and, importantly, to do this in an adaptive manner to also enable the reconstruction of irregular shapes with fine details. Specifically, we evaluate the faithfulness of the normal priors on-the-fly by checking the multi-view consistency of reconstruction during the optimization process. Only the normal priors accepted as faithful will be utilized for 3D reconstruction, which typically happens in the regions of smooth shapes possibly with weak texture. However, for those regions with small objects or thin structures, for which the normal priors are usually unreliable, we will only rely on visual features of the input images, since such regions typically contain relatively rich visual features (e.g., shade changes and boundary contours). Extensive experiments show that NeuRIS significantly outperforms the state-of-the-art methods in terms of reconstruction quality. Our project page: https://jiepengwang.github.io/NeuRIS/.
- Research Article
10
- 10.1109/tvcg.2020.3036868
- Nov 10, 2020
- IEEE transactions on visualization and computer graphics
We present a new framework for online dense 3D reconstruction of indoor scenes by using only depth sequences. This research is particularly useful in cases with a poor light condition or in a nearly featureless indoor environment. The lack of RGB information makes long-range camera pose estimation difficult in a large indoor environment. The key idea of our research is to take advantage of the geometric prior of Manhattan scenes in each stage of the reconstruction pipeline with the specific aim to reduce the cumulative registration error and overall odometry drift in a long sequence. This idea is further boosted by local Manhattan frame growing and the local-to-global strategy that leads to implicit loop closure handling for a large indoor scene. Our proposed pipeline, namely ManhattanFusion, starts with planar alignment and local pose optimization where the Manhattan constraints are imposed to create detailed local segments. These segments preserve intrinsic scene geometry by minimizing the odometry drift even under complex and long trajectories. The final model is generated by integrating all local segments into a global volumetric representation under the constraint of Manhattan frame-based registration across segments. Our algorithm outperforms others that use depth data only in terms of both the mean distance error and the absolute trajectory error, and it is also very competitive compared with RGB-D based reconstruction algorithms. Moreover, our algorithm outperforms the state-of-the-art in terms of the surface area coverage by 10-40 percent, largely due to the usefulness and effectiveness of the Manhattan assumption through the reconstruction pipeline.
- Research Article
3
- 10.1109/tvcg.2024.3444036
- Sep 1, 2025
- IEEE transactions on visualization and computer graphics
The reconstruction of indoor scenes from multi-view RGB images is challenging due to the coexistence of flat and texture-less regions alongside delicate and fine-grained regions. Recent methods leverage neural radiance fields aided by predicted surface normal priors to recover the scene geometry. These methods excel in producing complete and smooth results for floor and wall areas. However, they struggle to capture complex surfaces with high-frequency structures due to the inadequate neural representation and the inaccurately predicted normal priors. This work aims to reconstruct high-fidelity surfaces with fine-grained details by addressing the above limitations. To improve the capacity of the implicit representation, we propose a hybrid architecture to represent low-frequency and high-frequency regions separately. To enhance the normal priors, we introduce a simple yet effective image sharpening and denoising technique, coupled with a network that estimates the pixel-wise uncertainty of the predicted surface normal vectors. Identifying such uncertainty can prevent our model from being misled by unreliable surface normal supervisions that hinder the accurate reconstruction of intricate geometries. Experiments on the benchmark datasets show that our method outperforms existing methods in terms of reconstruction quality. Furthermore, the proposed method also generalizes well to real-world indoor scenarios captured by our hand-held mobile phones.
- Research Article
- 10.1007/s11222-024-10508-3
- Nov 15, 2024
- Statistics and Computing
This work combines multilevel Monte Carlo with importance sampling to estimate rare-event quantities that can be expressed as the expectation of a Lipschitz observable of the solution to a broad class of McKean–Vlasov stochastic differential equations. We extend the double loop Monte Carlo (DLMC) estimator introduced in this context in Ben Rached et al. (Stat Comput, 2024. https://doi.org/10.1007/s11222-024-10497-3) to the multilevel setting. We formulate a novel multilevel DLMC estimator and perform a comprehensive cost-error analysis yielding new and improved complexity results. Crucially, we devise an antithetic sampler to estimate level differences guaranteeing reduced computational complexity for the multilevel DLMC estimator compared with the single-level DLMC estimator. To address rare events, we apply the importance sampling scheme, obtained via stochastic optimal control in Ben Rached et al. (2024), over all levels of the multilevel DLMC estimator. Combining importance sampling and multilevel DLMC reduces computational complexity by one order and drastically reduces the associated constant compared to the single-level DLMC estimator without importance sampling. We illustrate the effectiveness of the proposed multilevel DLMC estimator on the Kuramoto model from statistical physics with Lipschitz observables, confirming the reduced complexity from O(TOLr-4) for the single-level DLMC estimator to O(TOLr-3) while providing a feasible estimate of rare-event quantities up to prescribed relative error tolerance TOLr.
- Research Article
17
- 10.1109/jsen.2020.3024702
- Sep 23, 2020
- IEEE Sensors Journal
Completeness and accuracy are two important factors in image-based indoor scene 3D reconstruction. Thus, an efficient image capturing scheme that could completely cover the scene, and a robust reconstruction method that could accurately reconstruct the scene are required. To this end, in this article we propose a new pipeline for indoor scene capturing and reconstruction using a mini drone and a ground robot, which takes both capturing completeness and reconstruction accuracy into consideration. First, we use a mini drone to capture aerial video of the indoor scene, from which a 3D aerial map is reconstructed. Then, the robot moving path is planned and a set of ground-view reference images are synthesized from the aerial map. After that, the robot enters the scene and captures ground video autonomously while using the reference images to locate its position during the movement. Finally, the ground and aerial images, which are adaptively extracted from the captured videos, are merged to reconstruct a complete and accurate indoor scene model. Experimental results on two indoor scenes demonstrate the effectiveness and robustness of our proposed indoor scene capturing and reconstruction pipeline.