SymmNeRF: Learning to Explore Symmetry Prior for Single-View View Synthesis
Abstract We study the problem of novel view synthesis of objects from a single image. Existing methods have demonstrated the potential in single-view view synthesis. However, they still fail to recover the fine appearance details, especially in self-occluded areas. This is because a single view only provides limited information. We observe that man-made objects usually exhibit symmetric appearances, which introduce additional prior knowledge. Motivated by this, we investigate the potential performance gains of explicitly embedding symmetry into the scene representation. In this paper, we propose SymmNeRF, a neural radiance field (NeRF) based framework that combines local and global conditioning under the introduction of symmetry priors. In particular, SymmNeRF takes the pixel-aligned image features and the corresponding symmetric features as extra inputs to the NeRF, whose parameters are generated by a hypernetwork. As the parameters are conditioned on the image-encoded latent codes, SymmNeRF is thus scene-independent and can generalize to new scenes. Experiments on synthetic and real-world datasets show that SymmNeRF synthesizes novel views with more details regardless of the pose transformation, and demonstrates good generalization when applied to unseen objects. Code is available at: https://github.com/xingyi-li/SymmNeRF.KeywordsNovel view synthesisNeRFSymmetryHyperNetwork
- Conference Article
2
- 10.1109/smc53654.2022.9945244
- Oct 9, 2022
Single image novel view synthesis allows the generation of target images with different views from a single input image. Pixel generation methods are one of the main approaches for novel view synthesis, with previous methods typically using the input images to infer the target image in the new view. However, only features from input images in the source view might not be sufficient to generate a good target image, especially when only a single input image is available. In this paper, we present a deep learning-based novel view synthesis approach that fuses features from an input and a warped image to collaboratively generate pixels in the new view. The warped image here is an intermediate output generated by projecting pixels of the input image onto the target view via an estimated depth. Since the estimated depth and the generated warped image are not perfect, errors will be introduced when generating target pixels. To alleviate these and to ensure better channel information between the features from input and warped image, channel attention blocks are employed. Experimental results on standard benchmark datasets show that our method produces excellent view synthesis results and outperforms other state-of-the-art methods.
- Conference Article
6
- 10.1145/3610548.3618155
- Dec 10, 2023
Single-image novel view synthesis is a challenging and ongoing problem that aims to generate an infinite number of consistent views from a single input image. Although significant efforts have been made to advance the quality of generated novel views, less attention has been paid to the expansion of the underlying scene representation, which is crucial to the generation of realistic novel view images. This paper proposes SinMPI, a novel method that uses an expanded multiplane image (MPI) as the 3D scene representation to significantly expand the perspective range of MPI and generate high-quality novel views from a large multiplane space. The key idea of our method is to use Stable Diffusion [Rombach et al. 2021] to generate out-of-view contents, project all scene contents into an expanded multiplane image according to depths predicted by monocular depth estimators, and then optimize the multiplane image under the supervision of pseudo multi-view data generated by a depth-aware warping and inpainting module. Both qualitative and quantitative experiments have been conducted to validate the superiority of our method to the state of the art. Our code and data are available at https://github.com/TrickyGo/SinMPI.
- Conference Article
1381
- 10.1109/cvpr46437.2021.00455
- Jun 1, 2021
We propose pixelNeRF, a learning framework that predicts a continuous neural scene representation conditioned on one or few input images. The existing approach for constructing neural radiance fields [27] involves optimizing the representation to every scene independently, requiring many calibrated views and significant compute time. We take a step towards resolving these shortcomings by introducing an architecture that conditions a NeRF on image inputs in a fully convolutional manner. This allows the network to be trained across multiple scenes to learn a scene prior, enabling it to perform novel view synthesis in a feed-forward manner from a sparse set of views (as few as one). Leveraging the volume rendering approach of NeRF, our model can be trained directly from images with no explicit 3D supervision. We conduct extensive experiments on ShapeNet benchmarks for single image novel view synthesis tasks with held-out objects as well as entire unseen categories. We further demonstrate the flexibility of pixelNeRF by demonstrating it on multi-object ShapeNet scenes and real scenes from the DTU dataset. In all cases, pixelNeRF outperforms current state-of-the-art baselines for novel view synthesis and single image 3D reconstruction. For the video and code, please visit the project website:https://alexyu.net/pixelnerf.
- Dissertation
- 10.32657/10356/200242
- Jan 1, 2024
3D scene representation and rendering have been pivotal tasks in 3D computer vision and computer graphics, essential for various applications such as virtual reality, augmented reality, and autonomous driving. As leading radiance field methods, neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) have recently achieved high-quality 3D scene representations by using MLPs and 3D primitives, respectively. In addition, they also achieve state-of-the-art scene rendering for novel view synthesis based on volume rendering and rasterization, respectively. Despite the significant progress in 3D scene representation and rendering, NeRF and 3DGS still face many challenges. First, proper NeRF training and high-quality scene representation and rendering depend on either reasonable camera pose initialization or manually-crafted camera pose distributions which are often unavailable, or hard to acquire in various real-world data. While Structure-from-Motion is frequently adopted to pre-compute camera poses, it is time-consuming and lacks differentiability which impedes the research and development of NeRF-based methods. The second is the domain gap issue in pose-free NeRF. One typical pipeline of pose-free NeRFs first trains a pose estimator with rendered images and then performs joint optimization of NeRF model and camera poses of real images predicted by the pose estimator. However, it relies solely on rendered images to train camera pose estimator, which often leads to biased and inaccurate camera pose estimation due to the domain gap between rendered and real images. This discrepancy can further result in local minima in the joint optimization of camera pose and NeRF scene representations. Third, 3DGS often suffers from an over-reconstruction issue during Gaussian densification, leading to suboptimal 3D scene representations and undesirable scene rendering with artifacts and blurred details. Fourth, 3DGS often comes with a large model size due to a large number of parameterized primitives required for explicit scene representations. While anchor-based 3DGS reduces 3D Gaussian redundancy, it often encounters the dilemma among anchor feature dimensions, model size and rendering quality. Large anchor feature size facilitates high-quality rendering but increases the model size due to numerous anchor points used in scene representation, whereas reducing feature size hinders accurate Gaussian prediction and leads to artifacts in rendered textures and structures. One significant challenge is thus to achieve high-quality scene representation and rendering with compact model size. In this thesis, we propose several innovative NeRF and 3DGS techniques that address the above issues successfully with superior 3D scene representation and rendering. First, we design a view matching NeRF (VMRF) that achieves superior NeRF representations without priors on camera poses or hand-crafted camera pose distributions. By leveraging unbalanced optimal transport, VMRF establishes feature correspondences between cross-view images to estimate relative camera poses, effectively mitigating reliance on prior pose information and distributions. Second, we propose IR-NeRF, a scene codebook-based implicit pose regularization framework for pose-free NeRF. IR-NeRF first constructs a scene codebook from unposed real images to store scene features and capture the scene-specific camera pose distribution implicitly as priors. It then employs the scene priors as regularization for promoting the robustness of camera pose estimation for real images and further improving the joint optimization of NeRF and camera poses. Third, we propose FreGS, an innovative 3D Gaussian splatting technique that addresses the over-reconstruction issue from frequency space. FreGS introduces a novel frequency annealing technique to achieve progressive frequency regularization, enabling coarse-to-fine Gaussian densification. It effectively improves the Gaussian densification, resulting in superior 3DGS-based scene representations and rendering for novel view synthesis. Fourth, we design SOGS, an advanced 3D Gaussian splatting technique that introduces second-order anchors to achieve superior rendering quality with reduced model size simultaneously. SOGS incorporates covariance-based second-order statistics to perform anchor feature augmentation, compensating for the reduced model size and improving the scene representation and rendering quality effectively. Overall, extensive experiments demonstrate that the proposed NeRF-based and 3DGS-based methods have effectively addressed or mitigated the aforementioned issues and achieved superior 3D scene representation and rendering.
- Conference Article
- 10.5121/csit.2023.131302
- Jul 29, 2023
Novel view synthesis is a long-standing problem that revolves around rendering frames of scenes from novel camera viewpoints. Volumetric approaches provide a solution for modeling occlusions through the explicit 3D representation of the camera frustum. Multi-plane Images (MPI) are volumetric methods that represent the scene using front-parallel planes at distinct depths but suffer from depth discretization leading to a 2.D scene representation. Another line of approach relies on implicit 3D scene representations. Neural Radiance Fields (NeRF) utilize neural networks for encapsulating the continuous 3D scene structure within the network weights achieving photorealistic synthesis results, however, methods are constrained to per-scene optimization settings which are inefficient in practice. Multi-plane Neural Radiance Fields (MINE) open the door for combining implicit and explicit scene representations. It enables continuous 3D scene representations, especially in the depth dimension, while utilizing the input image features to avoid per-scene optimization. The main drawback of the current literature work in this domain is being constrained to single-view input, limiting the synthesis ability to narrow viewpoint ranges. In this work, we thoroughly examine the performance, generalization, and efficiency of single-view multi-plane neural radiance fields. In addition, we propose a new multiplane NeRF architecture that accepts multiple views to improve the synthesis results and expand the viewing range. Features from the input source frames are effectively fused through a proposed attention-aware fusion module to highlight important information from different viewpoints. Experiments show the effectiveness of attention-based fusion and the promising outcomes of our proposed method when compared to multi-view NeRF and MPI techniques.
- Research Article
- 10.1109/tcsvt.2025.3643728
- Jan 1, 2025
- IEEE Transactions on Circuits and Systems for Video Technology
Recent advances in <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">single-view</i> 3D scene reconstruction have highlighted the challenges in capturing fine geometric details and ensuring structural consistency, particularly in high-fidelity outdoor scene modeling. This paper presents Niagara, a new single-view 3D scene reconstruction framework that can faithfully reconstruct challenging outdoor scenes from a single input image for the first time. Our approach integrates monocular depth and normal estimation as input, which substantially improves its ability to capture fine details, mitigating common issues like geometric detail loss and deformation. Additionally, we introduce a geometric affine field (GAF) and 3D self-attention as geometry-constraint, which combines the structural properties of explicit geometry with the adaptability of implicit feature fields, striking a balance between efficient rendering and high-fidelity reconstruction. Our framework finally proposes a specialized encoder-decoder architecture, where a depth-based 3D Gaussian decoder is proposed to predict 3D Gaussian parameters, which can be used for novel view synthesis. Extensive results and analyses suggest that our Niagara surpasses prior SoTA approaches such as Flash3D in both single-view and dual-view settings, significantly enhancing the geometric accuracy and visual fidelity, especially in outdoor scenes. Webpage: https://ai-kunkun.github.io/Niagara page/.
- Research Article
1
- 10.1007/s11548-024-03261-5
- Sep 13, 2024
- International Journal of Computer Assisted Radiology and Surgery
PurposeRGB-D cameras in the operating room (OR) provide synchronized views of complex surgical scenes. Assimilation of this multi-view data into a unified representation allows for downstream tasks such as object detection and tracking, pose estimation, and action recognition. Neural radiance fields (NeRFs) can provide continuous representations of complex scenes with limited memory footprint. However, existing NeRF methods perform poorly in real-world OR settings, where a small set of cameras capture the room from entirely different vantage points. In this work, we propose NeRF-OR, a method for 3D reconstruction of dynamic surgical scenes in the OR.MethodsWhere other methods for sparse-view datasets use either time-of-flight sensor depth or dense depth estimated from color images, NeRF-OR uses a combination of both. The depth estimations mitigate the missing values that occur in sensor depth images due to reflective materials and object boundaries. We propose to supervise with surface normals calculated from the estimated depths, because these are largely scale invariant.ResultsWe fit NeRF-OR to static surgical scenes in the 4D-OR dataset and show that its representations are geometrically accurate, where state of the art collapses to sub-optimal solutions. Compared to earlier work, NeRF-OR grasps fine scene details while training 30× faster. Additionally, NeRF-OR can capture whole-surgery videos while synthesizing views at intermediate time values with an average PSNR of 24.86 dB. Last, we find that our approach has merit in sparse-view settings beyond those in the OR, by benchmarking on the NVS-RGBD dataset that contains as few as three training views. NeRF-OR synthesizes images with a PSNR of 26.72 dB, a 1.7% improvement over state of the art.ConclusionOur results show that NeRF-OR allows for novel view synthesis with videos captured by a small number of cameras with entirely different vantage points, which is the typical camera setting in the OR. Code is available via: github.com/Beerend/NeRF-OR.
- Conference Article
140
- 10.1109/iccv48922.2021.01235
- Oct 1, 2021
In this paper, we propose MINE to perform novel view synthesis and depth estimation via dense 3D reconstruction from a single image. Our approach is a continuous depth generalization of the Multiplane Images (MPI) by introducing the NEural radiance fields (NeRF). Given a single image as input, MINE predicts a 4-channel image (RGB and volume density) at arbitrary depth values to jointly reconstruct the camera frustum and fill in occluded contents. The reconstructed and inpainted frustum can then be easily rendered into novel RGB or depth views using differentiable rendering. Extensive experiments on RealEstate10K, KITTI and Flowers Light Fields show that our MINE outperforms state-of-the-art by a large margin in novel view synthesis. We also achieve competitive results in depth estimation on iBims-1 and NYU-v2 without annotated depth supervision. Our source code is available at https://github.com/vincentfung13/MINE.
- Book Chapter
2
- 10.1007/978-3-030-69538-5_42
- Jan 1, 2021
This paper addresses the problem of novel view synthesis by means of neural rendering, where we are interested in predicting the novel view at an arbitrary camera pose based on a given set of input images from other viewpoints. Using the known query pose and input poses, we create an ordered set of observations that leads to the target view. Thus, the problem of single novel view synthesis is reformulated as a sequential view prediction task. In this paper, the proposed Transformer-based Generative Query Network (T-GQN) extends the neural-rendering methods by adding two new concepts. First, we use multi-view attention learning between context images to obtain multiple implicit scene representations. Second, we introduce a sequential rendering decoder to predict an image sequence, including the target view, based on the learned representations. Finally, we evaluate our model on various challenging datasets and demonstrate that our model not only gives consistent predictions but also doesn’t require any retraining for finetuning.
- Conference Article
105
- 10.1109/wacv51458.2022.00029
- Jan 1, 2022
In this work, we aim to address the 3D scene stylization problem - generating stylized images of the scene at arbitrary novel view angles. A straightforward solution is to combine existing novel view synthesis and image/video style transfer approaches, which often leads to blurry results or inconsistent appearance. Inspired by the high-quality results of the neural radiance fields (NeRF) method, we propose a joint framework to directly render novel views with the desired style. Our framework consists of two components: an implicit representation of the 3D scene with the neural radiance fields model, and a hypernetwork to transfer the style information into the scene representation. To alleviate the training difficulties and memory burden, we propose a two-stage training procedure and a patch sub-sampling approach to optimize the style and content losses with the neural radiance fields model. After optimization, our model is able to render consistent novel views at arbitrary view angles with arbitrary style. Both quantitative evaluation and human subject study have demonstrated that the proposed method generates faithful stylization results with consistent appearance across different views.
- Book Chapter
1
- 10.1007/978-3-031-20497-5_17
- Jan 1, 2022
We present CDNeRF, a simple yet powerful learning framework that creates novel view synthesis by reconstructing neural radiance fields from a single view RGB image. Novel view synthesis by neural radiance fields has achieved great improvement with the development of deep learning. However, how to make the method generic across scenes has always been a challenging task. A good idea is to introduce 2D image features as prior knowledge for adaptive modeling, yet RGB features (C) lack geometry and 3D spacial information. To compensate, we introduce depth features into the model. Our method uses a variant depth estimation network to extract depth features (D) without the need for additional input. In addition, we also introduce the transformer module to effectively fuse the multi-modal features of RGB and depth. Extensive experiments are carried out on two categories specific benchmarks (i.e., Chair, Car) and two category agnostic benchmarks (i.e., ShapeNet, DTU). The results demonstrate that our CDNeRF outperforms the previous methods, and achieves state-of-the-art neural rendering performance.
- Conference Article
5
- 10.1109/iros.2004.1390008
- Sep 28, 2004
In this paper, we compare two major structures of MSD (mass-spring-damper) particle models. One is the lattice (hexahedral) structure, and the other is the truss (tetrahedral) structure. They (especially, the truss structure) have been frequently used for representing elastic and/or visco-elastic object. The MSD model efficiently calculates shape deformation of the above materials. In addition, in order to maintain shape precision of each deformation, we carefully calibrate coefficients of damper and spring of Voigt part and a coefficient of damper of the other part in the basic MSD element under many surface points capturing a real rheologic object. A genetic algorithm is used for probabilistic calibration. After the comparison, we get the following properties: (1) the lattice structure has too many elements for calculating force propagation. Therefore, it precisely leads shape deformation with the help of the local (feedforward) volume constant condition. (2) The truss structure does not have enough elements for propagating internal forces, therefore, in order to keep a reasonable volume by expanding its virtual rheology object, we need the global (feedback) volume constant condition. (3) The global condition is time consuming, but can directly control the total volume of virtual rheology object. On the other hand, the local one is quick, but directly expands only a part (voxel) of the virtual object. Therefore, the volume and shape in the lattice structure with the local condition are better than those in the truss structure including the global one. (4) The number of MSD elements in the lattice structure is about two times larger than that in the truss one. Therefore, the former calculation is about two times slower than the latter one. As contrasted with this, the global volume constant condition is strictly two times or slower than the local one. As a result, calculation time of the lattice structure with the local condition is smaller than that of the truss structure with the global one. In conclusion, the lattice structure with the local volume constant condition is the best concerning to calculation cost and shape precision.
- Single Report
1
- 10.15760/etd.7294
- Feb 1, 2020
Novel view synthesis is a classic problem in computer vision. It refers to the generation of previously unseen views of a scene from a set of sparse input images taken from different viewpoints. One example of novel view synthesis is the interpolation of views in between the two images of a stereo camera. Another classic problem in computer vision is video frame interpolation, which is important for video processing. It refers to the generation of video frames in between existing ones and is commonly used to increase the frame rate of a video or to match the frame rate to the refresh rate of the monitor that the video is being displayed on. Interestingly, off-the-shelf video frame interpolation can directly be employed to successfully perform view interpolation to address the aforementioned stereo view interpolation problem. Video frame interpolation can be seen as temporal novel view synthesis. However, this perspective is usually not considered and novel view synthesis generally concerns generating unseen views in space rather than time. For this reason, the set of sparse input images that is used for spatial novel view synthesis is commonly either captured at the same time, or it is assumed that the scene is static. This paradigm limits the applicability of novel view synthesis in real-world scenarios though. This thesis addresses three applications of novel view synthesis and provides practical solutions that do not require difficult-to-acquire multi-view imagery: video frame interpolation which performs temporal video-to-video synthesis, synthesizing the 3D Ken Burns effect from a single image which performs spatial image-to-video synthesis, synthesizing video action shots which performs spatiotemporal video-to-video and video-to-image synthesis. These applications not only explore different dimensions of time and space, they also perform novel view synthesis on everyday image and video footage. This is in stark contrast to the large body of existing work which focuses on spatial novel view synthesis while requiring multiple input views that were either captured at the same time or under the assumption of a static scene.
- Conference Article
5
- 10.24963/ijcai.2024/203
- Aug 1, 2024
Neural fields are now the central focus of research in 3D vision and computer graphics. Existing methods mainly focus on various scene representations, such as neural points and 3D Gaussians. However, few works have studied the rendering process to enhance the neural fields. In this work, we propose a plug-in method named K-Buffers that leverages multiple buffers to improve the rendering performance. Our method first renders K buffers from scene representations and constructs K pixel-wise feature maps. Then, We introduce a K-Feature Fusion Network (KFN) to merge the K pixel-wise feature maps. Finally, we adopt a feature decoder to generate the rendering image. We also introduce an acceleration strategy to improve rendering speed and quality. We apply our method to well-known radiance field baselines, including neural point fields and 3D Gaussian Splatting (3DGS). Extensive experiments demonstrate that our method effectively enhances the rendering performance of neural point fields and 3DGS.
- Conference Article
56
- 10.1109/wacv56688.2023.00432
- Jan 1, 2023
We present Control-NeRF <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> , a method for performing flexible, 3D-aware image content manipulation while enabling high-quality novel view synthesis, from a set of posed input images. NeRF-based approaches [23] are effective for novel view synthesis, however such models memorize the radiance for every point in a scene within a neural network. Since these models are scene-specific and lack a 3D scene representation, classical editing such as shape manipulation, or combining scenes is not possible. While there are some recent hybrid approaches that combine NeRF with external scene representations such as sparse voxels, planes, hash tables, etc. [16], [5], [24], [9], they focus mostly on efficiency and don't explore the scene editing and manipulation capabilities of hybrid approaches. With the aim of exploring controllable scene representations for novel view synthesis, our model couples learnt scene-specific 3D feature volumes with a general NeRF rendering network. We can generalize to novel scenes by optimizing only the scene-specific 3D feature volume, while keeping the parameters of the rendering network fixed. Since the feature volumes are independent of the rendering model, we can manipulate and combine scenes by editing their corresponding feature volumes. The edited volume can then be plugged into the rendering model to synthesize high-quality novel views. We demonstrate scene manipulations including: scene mixing; applying rigid and non-rigid transformations; inserting, moving and deleting objects in a scene; while producing photo-realistic novel-view synthesis results.