Image-Based Rendering for Large-Scale Outdoor Scenes With Fusion of Monocular and Multi-View Stereo Depth
Image-based rendering (IBR) attempts to synthesize novel views using a set of observed images. Some IBR approaches (such as light fields) have yielded impressive high-quality results on small-scale scenes with dense photo capture. However, available wide-baseline IBR methods are still restricted by the low geometric accuracy and completeness of multi-view stereo (MVS) reconstruction on low-textured and non-Lambertian surfaces. The issues become more significant in large-scale outdoor scenes due to challenging scene content, e.g., buildings, trees, and sky. To address these problems, we present a novel IBR algorithm that consists of two key components. First, we propose a novel depth refinement method that combines MVS depth maps with monocular depth maps predicted via deep learning. A lookup table remap is proposed for converting the scale of the monocular depths to be consistent with the scale of the MVS depths. Then, the rescaled monocular depth is used as the constraint in the minimum spanning tree (MST)-based nonlocal filter to refine the per-view MVS depth. Second, we present an efficient shape-preserving warping algorithm that uses superpixels to generate the warped images and blend expected novel views of scenes. The proposed method has been evaluated on public MVS and view synthesis datasets, as well as newly captured large-scale outdoor datasets. In comparison with state-of-the-art methods, the experimental results demonstrated that the proposed method can obtain more complete and reliable depth maps for the challenging large-scale outdoor scenes, thereby resulting in more promising novel view synthesis.
- Research Article
578
- 10.1145/3272127.3275084
- Dec 4, 2018
- ACM Transactions on Graphics
Free-viewpoint image-based rendering (IBR) is a standing challenge. IBR methods combine warped versions of input photos to synthesize a novel view. The image quality of this combination is directly affected by geometric inaccuracies of multi-view stereo (MVS) reconstruction and by view- and image-dependent effects that produce artifacts when contributions from different input views are blended. We present a new deep learning approach to blending for IBR, in which we use held-out real image data to learn blending weights to combine input photo contributions. Our Deep Blending method requires us to address several challenges to achieve our goal of interactive free-viewpoint IBR navigation. We first need to provide sufficiently accurate geometry so the Convolutional Neural Network (CNN) can succeed in finding correct blending weights. We do this by combining two different MVS reconstructions with complementary accuracy vs. completeness tradeoffs. To tightly integrate learning in an interactive IBR system, we need to adapt our rendering algorithm to produce a fixed number of input layers that can then be blended by the CNN. We generate training data with a variety of captured scenes, using each input photo as ground truth in a held-out approach. We also design the network architecture and the training loss to provide high quality novel view synthesis, while reducing temporal flickering artifacts. Our results demonstrate free-viewpoint IBR in a wide variety of scenes, clearly surpassing previous methods in visual quality, especially when moving far from the input cameras.
- Research Article
22
- 10.1609/aaai.v37i3.25368
- Jun 26, 2023
- Proceedings of the AAAI Conference on Artificial Intelligence
Self-supervised monocular methods can efficiently learn depth information of weakly textured surfaces or reflective objects. However, the depth accuracy is limited due to the inherent ambiguity in monocular geometric modeling. In contrast, multi-frame depth estimation methods improve depth accuracy thanks to the success of Multi-View Stereo (MVS), which directly makes use of geometric constraints. Unfortunately, MVS often suffers from texture-less regions, non-Lambertian surfaces, and moving objects, especially in real-world video sequences without known camera motion and depth supervision. Therefore, we propose MOVEDepth, which exploits the MOnocular cues and VElocity guidance to improve multi-frame Depth learning. Unlike existing methods that enforce consistency between MVS depth and monocular depth, MOVEDepth boosts multi-frame depth learning by directly addressing the inherent problems of MVS. The key of our approach is to utilize monocular depth as a geometric priority to construct MVS cost volume, and adjust depth candidates of cost volume under the guidance of predicted camera velocity. We further fuse monocular depth and MVS depth by learning uncertainty in the cost volume, which results in a robust depth estimation against ambiguity in multi-view geometry. Extensive experiments show MOVEDepth achieves state-of-the-art performance: Compared with Monodepth2 and PackNet, our method relatively improves the depth accuracy by 20% and 19.8% on the KITTI benchmark. MOVEDepth also generalizes to the more challenging DDAD benchmark, relatively outperforming ManyDepth by 7.2%. The code is available at https://github.com/JeffWang987/MOVEDepth.
- Research Article
49
- 10.3390/rs16050773
- Feb 22, 2024
- Remote Sensing
Three-dimensional reconstruction is a key technology employed to represent virtual reality in the real world, which is valuable in computer vision. Large-scale 3D models have broad application prospects in the fields of smart cities, navigation, virtual tourism, disaster warning, and search-and-rescue missions. Unfortunately, most image-based studies currently prioritize the speed and accuracy of 3D reconstruction in indoor scenes. While there are some studies that address large-scale scenes, there has been a lack of systematic comprehensive efforts to bring together the advancements made in the field of 3D reconstruction in large-scale scenes. Hence, this paper presents a comprehensive overview of a 3D reconstruction technique that utilizes multi-view imagery from large-scale scenes. In this article, a comprehensive summary and analysis of vision-based 3D reconstruction technology for large-scale scenes are presented. The 3D reconstruction algorithms are extensively categorized into traditional and learning-based methods. Furthermore, these methods can be categorized based on whether the sensor actively illuminates objects with light sources, resulting in two categories: active and passive methods. Two active methods, namely, structured light and laser scanning, are briefly introduced. The focus then shifts to structure from motion (SfM), stereo matching, and multi-view stereo (MVS), encompassing both traditional and learning-based approaches. Additionally, a novel approach of neural-radiance-field-based 3D reconstruction is introduced. The workflow and improvements in large-scale scenes are elaborated upon. Subsequently, some well-known datasets and evaluation metrics for various 3D reconstruction tasks are introduced. Lastly, a summary of the challenges encountered in the application of 3D reconstruction technology in large-scale outdoor scenes is provided, along with predictions for future trends in development.
- Research Article
125
- 10.1109/tcsvt.2003.817350
- Nov 1, 2003
- IEEE Transactions on Circuits and Systems for Video Technology
Image-based rendering (IBR) has become a very active research area in recent years. The spectral analysis problem for IBR has not been completely solved. In this paper, we present a new method to parameterize the problem, which is applicable for general-purpose IBR spectral analysis. We notice that any plenoptic function is generated by light ray emitted/reflected/refracted from the object surface. We introduce the surface plenoptic function (SPF), which represents the light rays starting from the object surface. Given that radiance along a light ray does not change unless the light ray is blocked, SPF reduces the dimension of the original plenoptic function to 6D. We are then able to map or transform the SPF to IBR representations captured along any camera trajectory. Assuming some properties on the SPF, we can analyze the properties of IBR for generic scenes such as scenes with Lambertian or non-Lambertian surfaces and scenes with or without occlusions, and for different sampling strategies such as lightfield/concentric mosaic. We find that in most cases, even though the SPF may be band-limited, the frequency spectrum of IBR is not band-limited. We show that non-Lambertian reflections, depth variations and occlusions can all broaden the spectrum, with the latter two being more significant. SPF is defined for scenes with known geometry. When the geometry is unknown, spectral analysis is still possible. We show that with the truncating windows analysis and some conclusions obtained with SPF, the spectrum expansion caused by non-Lambertian reflections and occlusions can be quantatively estimated, even when the scene geometry is not explicitly known. Given the spectrum of IBR, we also study how to sample IBR data more efficiently. Our analysis is based on the generalized periodic sampling theory with arbitrary geometry. We show that the sampling efficiency can be up to twice of that when we use rectangular sampling. The advantages and disadvantages of generalized periodic sampling for IBR are also discussed.
- Research Article
25
- 10.1109/tcsvt.2009.2026948
- Nov 1, 2009
- IEEE Transactions on Circuits and Systems for Video Technology
In this paper, we propose a real-time image-based rendering (IBR) system. It is specifically designed for photorealistic view synthesis at high-speed on the graphics processing unit (GPU). We steer the proposed IBR system design with two high-level ideas. First, for cost-effective IBR, as long as the synthesized views look visually plausible, the estimated disparity and occlusion need not be correct. Hence, we jointly optimize stereo matching and view synthesis for a favorable end-to-end performance. Second, for great real-time acceleration on GPUs, all functional modules need be shaped at an early design stage, fitting the massively parallel streaming architecture of GPUs. Based on these two guidelines, we first propose a stream-centric local stereo matching algorithm. The key idea is to construct a versatile set of variable support patterns in a highly efficient manner, and then an optimal local support pattern is selected to approximate varying image structures adaptively. Next, a low-complexity adaptive view synthesis technique is proposed. It efficiently tackles visual artifacts in synthesized images, using a novel photometric outlier detection and handling scheme. We evaluated both the disparity estimation accuracy and novel view synthesis quality of the proposed approach, based on the benchmark Middlebury stereo datasets. The experiments show that our local stereo method produces consistently reliable disparity estimates for both homogeneous regions and depth discontinuities, outperforming several previous GPU-based local methods. More importantly, visually plausible intermediate views are generated by our IBR approach at high-speed on the GPU. With stereo matching and view synthesis completely running on an NVIDIA GeForce 8800 GT graphics card, the proposed IBR system reaches about 100 f/s for 450times375 stereo images with 60 disparity levels.
- Research Article
405
- 10.1145/2487228.2487238
- Jun 1, 2013
- ACM Transactions on Graphics
Modern camera calibration and multiview stereo techniques enable users to smoothly navigate between different views of a scene captured using standard cameras. The underlying automatic 3D reconstruction methods work well for buildings and regular structures but often fail on vegetation, vehicles, and other complex geometry present in everyday urban scenes. Consequently, missing depth information makes Image-Based Rendering (IBR) for such scenes very challenging. Our goal is to provide plausible free-viewpoint navigation for such datasets. To do this, we introduce a new IBR algorithm that is robust to missing or unreliable geometry, providing plausible novel views even in regions quite far from the input camera positions. We first oversegment the input images, creating superpixels of homogeneous color content which often tends to preserve depth discontinuities. We then introduce a depth synthesis approach for poorly reconstructed regions based on a graph structure on the oversegmentation and appropriate traversal of the graph. The superpixels augmented with synthesized depth allow us to define a local shape-preserving warp which compensates for inaccurate depth. Our rendering algorithm blends the warped images, and generates plausible image-based novel views for our challenging target scenes. Our results demonstrate novel view synthesis in real time for multiple challenging scenes with significant depth complexity, providing a convincing immersive navigation experience.
- Dissertation
- 10.14711/thesis-991012786067603412
- Jan 1, 2019
Multi-view stereo (MVS) reconstructs 3D representations of the scene from imagery, which is a core problem of computer vision extensively studied for decades. Traditionally, MVS algorithms apply hand-crafted similarity metrics and engineered regularizations to compute dense correspondences. While these methods have shown great results under ideal Lambertian scenarios, classical MVS algorithms still suffer from numerous artifacts. In this thesis, we propose to advance the MVS reconstruction using recent deep learning techniques. First, we present an end-to-end deep learning architecture, MVSNet, for depth map inference from multi-view images. The key contribution of this part is the careful integration between multi-view geometries and convolutional neural networks (CNNs). In the network, we extract deep image features and build the 3D cost volume upon the camera frustum via the differentiable homography warping. Then, 3D convolutions are applied to regularize and regress the output depth map. We demonstrate on DTU dataset that MVSNet significantly outperforms previous state-of-the-arts in both reconstruction completeness and overall quality. Next, we propose to extend the MVSNet architecture for large-scale MVS reconstruction. One major limitation of current learning-based approaches is the scalability: the memory-consuming cost volume regularization makes the learned MVS hard to be applied to high-resolution scenes. To this end, we sequentially regularize 2D cost maps via the gated recurrent unit (GRU) rather than regularize the entire 3D cost volume in one go. The GRU regularization dramatically reduces memory consumption and makes high-resolution reconstructions feasible. The proposed R-MVSNet is evaluated on the large-scale Tanks and Temples dataset and achieves comparable results to classical large-scale MVS algorithms. Finally, we establish a large-scale synthetic MVS dataset, BlendedMVS, based on blended images and rendered depth maps. While several MVS datasets have been proposed, they fail to provide accurate depth and occlusion information as ground truth mesh models are usually incomplete. We therefore establish a new MVS dataset based on model rendering. Textured meshes are first reconstructed from images of different scenes, which are then rendered into color images, depth maps and occlusion maps. We further blend rendered images with input images using high-pass and low-pass filters to generate our training input. Extensive experiments demonstrate that models trained on BlendedMVS achieve significant better generalization ability compared with models trained on other MVS datasets. In sum, this thesis presents a complete learning-based solution to large-scale multi-view stereopsis, including a current baseline network (MVSNet), its large-scale extension (R-MVSNet) and a large-scale synthetic dataset (BlendedMVS). We bridge the gap between classical MVS reconstructions and recent deep learning techniques and demonstrate the effectiveness of the learning-based MVS through extensive experiments on different datasets.
- Conference Article
- 10.1109/robio.2005.246310
- Jan 1, 2005
In this paper, we address the problem of image based rendering (IBR) on images taken by a mobile robot, called robot image database. IBR is a standard approach to view synthesis that would be very important for human-robot interface systems such as a teleoperation system, since synthesized views are helpful for a human user to understand the robot's operating environment. Main difficulty of our problem is that the viewpoint locations of input (real) images are not precisely known, due to estimation errors inherent in the positioning systems. To solve this problem, we propose a novel IBR method, where the location uncertainty is reduced using a visual landmark, which is commonly used in mobile robotics. Also, novel priors on real images are introduced to regularize the IBR problem. As a result, IBR can be performed successfully under the location uncertainty
- Research Article
113
- 10.1145/3306346.3323013
- Jul 12, 2019
- ACM Transactions on Graphics
We propose the first learning-based algorithm that can relight images in a plausible and controllable manner given multiple views of an outdoor scene. In particular, we introduce a geometry-aware neural network that utilizes multiple geometry cues (normal maps, specular direction, etc.) and source and target shadow masks computed from a noisy proxy geometry obtained by multi-view stereo. Our model is a three-stage pipeline: two subnetworks refine the source and target shadow masks, and a third performs the final relighting. Furthermore, we introduce a novel representation for the shadow masks, which we call RGB shadow images. They reproject the colors from all views into the shadowed pixels and enable our network to cope with inacuraccies in the proxy and the non-locality of the shadow casting interactions. Acquiring large-scale multi-view relighting datasets for real scenes is challenging, so we train our network on photorealistic synthetic data. At train time, we also compute a noisy stereo-based geometric proxy, this time from the synthetic renderings. This allows us to bridge the gap between the real and synthetic domains. Our model generalizes well to real scenes. It can alter the illumination of drone footage, image-based renderings, textured mesh reconstructions, and even internet photo collections.
- Research Article
140
- 10.1109/msp.2007.905702
- Nov 1, 2007
- IEEE Signal Processing Magazine
One of the most important applications in multiview imaging (MVI) is the development of advanced immersive viewing or visualization systems using, for instance, 3DTV. With the introduction of multiview TVs, it is expected that a new age of 3DTV systems will arrive in the near future. Image-based rendering (IBR) refers to a collection of techniques and representations that allow 3-D scenes and objects to be visualized in a realistic way without full 3-D model reconstruction. IBR uses images as the primary substrate. The potential for photorealistic visualization has tremendous appeal, and it has been receiving increasing attention over the years. Applications such as video games, virtual travel, and E-commerce stand to benefit from this technology. This article serves as a tutorial introduction and brief review of this important technology. First the classification, principles, and key research issues of IBR are discussed. Then, an object-based IBR system to illustrate the techniques involved and its potential application in view synthesis and processing are explained. Stereo matching, which is an important technique for depth estimation and view synthesis, is briefly explained and some of the top-ranked methods are highlighted. Finally, the challenging problem of interactive IBR is explained. Possible solutions and some state-of-the-art systems are also reviewed.
- Conference Article
7
- 10.1109/oceans.2010.5664318
- Sep 1, 2010
Over the last several decades developments in Underwater Laser Line Scan (LLS) systems have resulted in significant improvements in turbid water imaging performance. In addition to allowing for high quality image acquisition through tens of attenuation lengths, the recently renewed interest in multiple platform distributed LLS configurations also has the potential for synoptic coverage of much larger regions of seabed. A related issue worth investigation is how to utilize these capabilities to improve rendering of the underwater scenes. In this regard, Light Field Rendering (LFR) - a type of Image Based Rendering (IBR) technique offers several advantages. LFR enables multi-perspective target visualization without measuring the geometrical dimension of the target. Compared to other IBR techniques, LFR can provide Signal-to-Noise Ratio (SNR) improvements and the ability to image through obscuring objects in front of the target. On the other hand, multi-static LLS can be readily configured to acquired images to generate LFR. This paper investigates the application of LFR to images taken from a distributed bi-static LLS imager to create multi-perspective rendering of an unknown underwater scene. The issues related to effectively applying this technique to underwater LLS imagery are analyzed and image post-processing flow to addresses these issues are proposed. An experiment was conducted in FAU-HBOI optical imaging test tank, the results from which demonstrated the capability of using bi-static/multi-static LLS system to generated LFR and also verified the proposed image processing flow. The aforementioned benefits of LFR were also presented.
- Research Article
104
- 10.3390/rs11010063
- Dec 31, 2018
- Remote Sensing
High-throughput phenotyping technologies have become an increasingly important topic of crop science in recent years. Various sensors and data acquisition approaches have been applied to acquire the phenotyping traits. It is quite confusing for crop phenotyping researchers to determine an appropriate way for their application. In this study, three representative three-dimensional (3D) data acquisition approaches, including 3D laser scanning, multi-view stereo (MVS) reconstruction, and 3D digitizing, were evaluated for maize plant phenotyping in multi growth stages. Phenotyping traits accuracy, post-processing difficulty, device cost, data acquisition efficiency, and automation were considered during the evaluation process. 3D scanning provided satisfactory point clouds for medium and high maize plants with acceptable efficiency, while the results were not satisfactory for small maize plants. The equipment used in 3D scanning is expensive, but is highly automatic. MVS reconstruction provided satisfactory point clouds for small and medium plants, and point deviations were observed in upper parts of higher plants. MVS data acquisition, using low-cost cameras, exhibited the highest efficiency among the three evaluated approaches. The one-by-one pipeline data acquisition pattern allows the use of MVS high-throughput in further phenotyping platforms. Undoubtedly, enhancement of point cloud processing technologies is required to improve the extracted phenotyping traits accuracy for both 3D scanning and MVS reconstruction. Finally, 3D digitizing was time-consuming and labor intensive. However, it does not depend on any post-processing algorithms to extract phenotyping parameters and reliable phenotyping traits could be derived. The promising accuracy of 3D digitizing is a better verification choice for other 3D phenotyping approaches. Our study provides clear reference about phenotyping data acquisition of maize plants, especially for the affordable and portable field phenotyping platforms to be developed.
- Book Chapter
- 10.1007/978-981-19-5096-4_9
- Jan 1, 2022
To address the problem of incomplete Multi-view Stereo (MVS) reconstruction, the initial depth and loss function of the depth residual iterative network are investigated, and a new multi-view stereo reconstruction network integrating depth normal consistency and depth map thinning is presented. Firstly, downsampling the input image to create an image pyramid and extracting a feature map from the image pyramid; Then, constructing a cost volume from the 2D feature map, adding the depth normal consistency to the initial cost volume to optimize the depth map. On the DTU data set, the network is tested and compared to traditional reconstruction approaches and MVS networks based on deep learning. The experimental results show that the proposed MVS reconstruction network was produced the better results in completeness and increased the quality of MVS reconstruction.KeywordsNormal-depth consistencyFeature lossCost volumeDepth map refinementMVS
- Conference Article
7
- 10.1145/3384382.3384523
- May 5, 2020
Multi-view stereo can be used to rapidly create realistic virtual content, such as textured meshes or a geometric proxy for free-viewpoint Image-Based Rendering (IBR). These solutions greatly simplify the content creation process compared to traditional methods, but it is difficult to modify the content of the scene. We propose a novel approach to create scenes by composing (parts of) multiple captured scenes. The main difficulty of such compositions is that lighting conditions in each captured scene are different; to obtain a realistic composition we need to make lighting coherent. We propose a two-pass solution, by adapting a multi-view relighting network. We first match the lighting conditions of each scene separately and then synthesize shadows between scenes in a subsequent pass. We also improve the realism of the composition by estimating the change in ambient occlusion in contact areas between parts and compensate for the color balance of the different cameras used for capture. We illustrate our method with results on multiple compositions of outdoor scenes and show its application to multi-view image composition, IBR and textured mesh creation.
- Conference Article
- 10.1109/iros.2005.1545558
- Jan 1, 2005
Image based rendering (IBR) is one of standard approaches to view synthesis. IBR would be very important for human-robot interface system (HRI) such as a teleoperation system, since the synthesized views are helpful for a human user to understand the robot's operating environment. In this paper, we address the problem of IBR on images taken by a mobile robot. Main difficulty of our problem is that the viewpoint locations of input (real) images are not precisely known, due to estimation errors inherent in the positioning systems. To solve this problem, we will propose a novel IBR method, where the location uncertainty is reduced using a visual landmark, which is commonly used in mobile robotics. In addition, novel priors on real images are introduced to regularize the IBR problem. As a result, IBR can be performed successfully under the location uncertainty.