Dynamic street scene 3D reconstruction with self-supervised Gaussian Splatting using spatiotemporal deformation field
ABSTRACT Accurate 3D reconstruction of dynamic street scenes is crucial for autonomous driving, yet existing methods either require costly 3D annotation boxes or fail to capture fine object motion. To overcome these limitations, we propose SSTD-GS, a self-supervised Gaussian Splatting framework for annotation-free dynamic scene reconstruction and novel view synthesis. Specifically, we design a spatiotemporal deformation field to model the detailed motion of dynamic objects, and develop an uncertainty dynamic mask guided self-supervised strategy to enable joint optimization of dynamic and static scene components. To further improve the quality of novel view synthesis, with the help of the powerful priors of the depth completion model and diffusion model, we design a confidence dense depth prior module and a diffusion model virtual view prior module to provide additional geometric and appearance constraints. Moreover, a geometry aware Gaussian adaptive control mechanism is employed to suppress inaccurate densification in 3DGS caused by rendering errors. Experimental results on the Waymo and KITTI datasets show that SSTD-GS outperforms existing NeRF and 3DGS-based methods in 4D scene reconstruction and novel view synthesis. In the novel view synthesis task, the PSNR reaches 29.83 and 28.59 dB, respectively, which are 1.72 and 1.36 dB higher than the suboptimal PVG.
- Research Article
15
- 10.1371/journal.pone.0055586
- May 7, 2013
- PLoS ONE
Remote dynamic three-dimensional (3D) scene reconstruction renders the motion structure of a 3D scene remotely by means of both the color video and the corresponding depth maps. It has shown a great potential for telepresence applications like remote monitoring and remote medical imaging. Under this circumstance, video-rate and high resolution are two crucial characteristics for building a good depth map, which however mutually contradict during the depth sensor capturing. Therefore, recent works prefer to only transmit the high-resolution color video to the terminal side, and subsequently the scene depth is reconstructed by estimating the motion vectors from the video, typically using the propagation based methods towards a video-rate depth reconstruction. However, in most of the remote transmission systems, only the compressed color video stream is available. As a result, color video restored from the streams has quality losses, and thus the extracted motion vectors are inaccurate for depth reconstruction. In this paper, we propose a precise and robust scheme for dynamic 3D scene reconstruction by using the compressed color video stream and their inaccurate motion vectors. Our method rectifies the inaccurate motion vectors by analyzing and compensating their quality losses, motion vector absence in spatial prediction, and dislocation in near-boundary region. This rectification ensures the depth maps can be compensated in both video-rate and high resolution at the terminal side towards reducing the system consumption on both the compression and transmission. Our experiments validate that the proposed scheme is robust for depth map and dynamic scene reconstruction on long propagation distance, even with high compression ratio, outperforming the benchmark approaches with at least 3.3950 dB quality gains for remote applications.
- Conference Article
68
- 10.1109/iccv.2015.109
- Dec 1, 2015
This paper introduces a general approach to dynamic scene reconstruction from multiple moving cameras without prior knowledge or limiting constraints on the scene structure, appearance, or illumination. Existing techniques or dynamic scene reconstruction from multiple wide-baseline camera views primarily focus on accurate reconstruction in controlled environments, where the cameras are fixed and calibrated and background is known. These approaches are not robust for general dynamic scenes captured with sparse moving cameras. Previous approaches for outdoor dynamic scene reconstruction assume prior knowledge of the static background appearance and structure. The primary contributions of this paper are twofold: an automatic method for initial coarse dynamic scene segmentation and reconstruction without prior knowledge of background appearance or structure, and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes from multiple wide-baseline static or moving cameras. Evaluation is performed on a variety of indoor and outdoor scenes with cluttered backgrounds and multiple dynamic non-rigid objects such as people. Comparison with state-of-the-art approaches demonstrates improved accuracy in both multiple view segmentation and dense reconstruction. The proposed approach also eliminates the requirement for prior knowledge of scene structure and appearance.
- Research Article
- 10.5194/isprs-archives-xlviii-g-2025-649-2025
- Jul 28, 2025
- The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Abstract. Reconstructing dynamic urban scenes from unmanned aerial vehicle (UAV) full-motion videos is a vital task with significant applications in urban planning, traffic analysis, and autonomous navigation. However, modeling these scenes is challenging due to their large scale and, more importantly, the ever-changing presence of dynamic objects such as vehicles and pedestrians. In recent years, emerging neural 3D scene representation approaches have gained popularity for their promising performance in novel view synthesis, and several recent works have further explored the potential of modeling large-scale and dynamic scenes. While most existing methods focus on indoor or street-level scenes, very little effort has been made to address the unique complexities of dynamic urban environments captured by UAVs. To investigate this problem, we apply a recently developed dynamic 3D Gaussian Splatting framework that decomposes urban scenes into static and dynamic elements, thereby achieving efficient and accurate modeling. We further reduce the need for auxiliary input data, thereby accommodating more general cases in which only video sequences are available. Specifically, we propose a pipeline for automatically tracking dynamic vehicles using trajectory optimization to model their natural movement, thereby eliminating the dependency on prior knowledge of vehicles — which is often unavailable in real-life scenarios. By integrating the dynamic 3D Gaussian Splatting framework with the photogrammetric reconstruction pipeline, our pipeline offers scalable and reliable 3D dynamic scene reconstruction. Our pipeline is evaluated on multiple UAV datasets, and the results demonstrate the promising quality of scene reconstruction and view synthesis.
- Conference Article
47
- 10.1109/cvpr.2017.592
- Jul 1, 2017
In this paper we propose a framework for spatially and temporally coherent semantic co-segmentation and reconstruction of complex dynamic scenes from multiple static or moving cameras. Semantic co-segmentation exploits the coherence in semantic class labels both spatially, between views at a single time instant, and temporally, between widely spaced time instants of dynamic objects with similar shape and appearance. We demonstrate that semantic coherence results in improved segmentation and reconstruction for complex scenes. A joint formulation is proposed for semantically coherent object-based co-segmentation and reconstruction of scenes by enforcing consistent semantic labelling between views and over time. Semantic tracklets are introduced to enforce temporal coherence in semantic labelling and reconstruction between widely spaced instances of dynamic objects. Tracklets of dynamic objects enable unsupervised learning of appearance and shape priors that are exploited in joint segmentation and reconstruction. Evaluation on challenging indoor and outdoor sequences with hand-held moving cameras shows improved accuracy in segmentation, temporally coherent semantic labelling and 3D reconstruction of dynamic scenes.
- Research Article
- 10.1609/aaai.v39i9.33045
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
While Neural Radiance Fields (NeRFs) have advanced the frontiers of novel view synthesis (NVS) using LiDAR data, they still struggle in dynamic scenes. Due to the low frequency and sparsity characteristics of LiDAR point clouds, it is challenging to spontaneously learn a dynamic and consistent scene representation from posed scans. In this paper, we propose STGC-NeRF, a novel LiDAR NeRF method that combines spatial-temporal geometry consistency to enhance the reconstruction of dynamic scenes. First, we propose a temporal geometry consistency regularization to enhance the regression of time-varying scene geometries from low-frequency LiDAR sequences. By estimating the pointwise correspondences between synthetic (or real) and real frames at different times, we convert them into various forms of temporal supervision. This alleviates the inconsistency caused by moving objects in dynamic scenes. Second, to improve the reconstruction of sparse LiDAR data, we propose spatial geometric consistency constraints. By computing multiple neighborhood feature descriptors incorporating geometric and contextual information, we capture structural geometry information from sparse LiDAR data. This helps encourage consistent direction, smoothness, and detail of the local surface. Extensive experiments on the KITTI-360 and nuScenes datasets demonstrate that STGC-NeRF outperforms state-of-the-art methods in both geometry and intensity accuracy for dynamic LiDAR scene reconstruction.
- Research Article
- 10.3390/app15084190
- Apr 10, 2025
- Applied Sciences
This paper presents a novel 3D Gaussian Splatting (3DGS)-based Simultaneous Localization and Mapping (SLAM) system that integrates Light Detection and Ranging (LiDAR) and vision data to enhance dynamic scene tracking and reconstruction. Existing 3DGS systems face challenges in sensor fusion and handling dynamic objects. To address these, we introduce a hybrid uncertainty-based 3D segmentation method that leverages uncertainty estimation and 3D object detection, effectively removing dynamic points and improving static map reconstruction. Our system also employs a sliding window-based keyframe fusion strategy that reduces computational load while maintaining accuracy. By incorporating a novel dynamic rendering loss function and pruning techniques, we suppress artifacts such as ghosting and ensure real-time operation in complex environments. Extensive experiments show that our system outperforms existing methods in dynamic object removal and overall reconstruction quality. The key innovations of our work lie in its integration of hybrid uncertainty-based segmentation, dynamic rendering loss functions, and an optimized sliding window strategy, which collectively enhance robustness and efficiency in dynamic scene reconstruction. This approach offers a promising solution for real-time robotic applications, including autonomous navigation and augmented reality.
- Conference Article
64
- 10.1109/cvpr42600.2020.00538
- Jun 1, 2020
This paper presents a new method to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene. A key challenge for the novel view synthesis arises from dynamic scene reconstruction where epipolar geometry does not apply to the local motion of dynamic contents. To address this challenge, we propose to combine the depth from single view (DSV) and the depth from multi-view stereo (DMV), where DSV is complete, i.e., a depth is assigned to every pixel, yet view-variant in its scale, while DMV is view-invariant yet incomplete. Our insight is that although its scale and quality are inconsistent with other views, the depth estimation from a single view can be used to reason about the globally coherent geometry of dynamic contents. We cast this problem as learning to correct the scale of DSV, and to refine each depth with locally consistent motions between views to form a coherent depth estimation. We integrate these tasks into a depth fusion network in a self-supervised fashion. Given the fused depth maps, we synthesize a photorealistic virtual view in a specific location and time with our deep blending network that completes the scene and renders the virtual view. We evaluate our method of depth estimation and view synthesis on a diverse real-world dynamic scenes and show the outstanding performance over existing methods.
- Research Article
22
- 10.1016/j.cosrev.2020.100338
- Dec 5, 2020
- Computer Science Review
Real-time 3D reconstruction techniques applied in dynamic scenes: A systematic literature review
- Research Article
8
- 10.1016/j.imavis.2024.105304
- Oct 19, 2024
- Image and Vision Computing
A review of recent advances in 3D Gaussian Splatting for optimization and reconstruction
- Research Article
- 10.1088/1742-6596/1345/6/062037
- Nov 1, 2019
- Journal of Physics: Conference Series
Word In the process of reconstructing dynamic scenes, the traditional 3D reconstruction will generate certain data interference and data transition problems due to the certain continuity of the collected data. A three-dimensional dynamic scene reconstruction scheme based on virtual reality is proposed. Virtual reality technology is an emerging technology. The key is to model the actual object to obtain a virtual image and present it to people. Three-dimensional reconstruction is one of the core technologies, including monocular vision technology, pattern recognition technology, support vector machine computing technology and sensor technology.
- Research Article
- 10.1007/s11548-024-03261-5
- Sep 13, 2024
- International Journal of Computer Assisted Radiology and Surgery
PurposeRGB-D cameras in the operating room (OR) provide synchronized views of complex surgical scenes. Assimilation of this multi-view data into a unified representation allows for downstream tasks such as object detection and tracking, pose estimation, and action recognition. Neural radiance fields (NeRFs) can provide continuous representations of complex scenes with limited memory footprint. However, existing NeRF methods perform poorly in real-world OR settings, where a small set of cameras capture the room from entirely different vantage points. In this work, we propose NeRF-OR, a method for 3D reconstruction of dynamic surgical scenes in the OR.MethodsWhere other methods for sparse-view datasets use either time-of-flight sensor depth or dense depth estimated from color images, NeRF-OR uses a combination of both. The depth estimations mitigate the missing values that occur in sensor depth images due to reflective materials and object boundaries. We propose to supervise with surface normals calculated from the estimated depths, because these are largely scale invariant.ResultsWe fit NeRF-OR to static surgical scenes in the 4D-OR dataset and show that its representations are geometrically accurate, where state of the art collapses to sub-optimal solutions. Compared to earlier work, NeRF-OR grasps fine scene details while training 30× faster. Additionally, NeRF-OR can capture whole-surgery videos while synthesizing views at intermediate time values with an average PSNR of 24.86 dB. Last, we find that our approach has merit in sparse-view settings beyond those in the OR, by benchmarking on the NVS-RGBD dataset that contains as few as three training views. NeRF-OR synthesizes images with a PSNR of 26.72 dB, a 1.7% improvement over state of the art.ConclusionOur results show that NeRF-OR allows for novel view synthesis with videos captured by a small number of cameras with entirely different vantage points, which is the typical camera setting in the OR. Code is available via: github.com/Beerend/NeRF-OR.
- Research Article
4
- 10.1007/s11263-019-01241-w
- Oct 3, 2019
- International Journal of Computer Vision
Simultaneous semantically coherent object-based long-term 4D scene flow estimation, co-segmentation and reconstruction is proposed exploiting the coherence in semantic class labels both spatially, between views at a single time instant, and temporally, between widely spaced time instants of dynamic objects with similar shape and appearance. In this paper we propose a framework for spatially and temporally coherent semantic 4D scene flow of general dynamic scenes from multiple view videos captured with a network of static or moving cameras. Semantic coherence results in improved 4D scene flow estimation, segmentation and reconstruction for complex dynamic scenes. Semantic tracklets are introduced to robustly initialize the scene flow in the joint estimation and enforce temporal coherence in 4D flow, semantic labelling and reconstruction between widely spaced instances of dynamic objects. Tracklets of dynamic objects enable unsupervised learning of long-term flow, appearance and shape priors that are exploited in semantically coherent 4D scene flow estimation, co-segmentation and reconstruction. Comprehensive performance evaluation against state-of-the-art techniques on challenging indoor and outdoor sequences with hand-held moving cameras shows improved accuracy in 4D scene flow, segmentation, temporally coherent semantic labelling, and reconstruction of dynamic scenes.
- Conference Article
- 10.1109/itnt55410.2022.9848739
- May 23, 2022
The task of reconstructing intermediate frames of a video is to increase the frame rate by synthesizing new frames based on information about neighboring frames of the video. A higher frame rate improves the quality of the visual experience. The result is smoother video transitions and reduced motion blur in large-scale motion. Methods for reconstructing intermediate frames are also widely used in solving the problem of 3D reconstruction of dynamic objects and scenes. This paper presents the result of comparing the performance of state-of-the-art methods for the reconstruction of intermediate frames XVFI, RRIN, CDFI, RIFE, AdaCof on video with a dynamic scene.
- Research Article
- 10.3390/electronics14122347
- Jun 8, 2025
- Electronics
The 3DGS (3D Gaussian Splatting) series of works has achieved significant success in novel view synthesis, but further research is needed for dynamic scene reconstruction tasks. In this paper, we propose a new framework based on 3DGS for handling dynamic scene reconstruction problems involving color changes. Our approach employs a multi-stage training strategy combining motion and color deformation fields to accurately model dynamic geometry and appearance changes. Additionally, we design two modular components: the Dynamic Component for capturing motion variations and the Color Component for managing material and color changes. These components flexibly adapt to different scenes, enhancing our method’s versatility. Experimental results demonstrate that our method achieves real-time rendering at 80 FPS on an RTX 4090 and achieves higher reconstruction accuracy than baseline methods such as HexPlane and Deformable3DGS. Furthermore, it reduces training time by approximately 10%, indicating improved training efficiency. These quantitative results confirm the effectiveness of our approach in delivering high-fidelity 4D reconstruction of complex dynamic environments.
- Book Chapter
3
- 10.1007/978-3-030-28603-3_5
- Jan 1, 2019
A key task in computer vision is that of generating virtual 3D models of real-world scenes by reconstructing the shape, appearance and, in the case of dynamic scenes, motion of the scene from visual sensors. Recently, low-cost video plus depth (RGB-D) sensors have become widely available and have been applied to 3D reconstruction of both static and dynamic scenes. RGB-D sensors contain an active depth sensor, which provides a stream of depth maps alongside standard colour video. The low cost and ease of use of RGB-D devices as well as their video rate capture of images along with depth make them well suited to 3D reconstruction. Use of active depth capture overcomes some of the limitations of passive monocular or multiple-view video-based approaches since reliable, metrically accurate estimates of the scene depth at each pixel can be obtained from a single view, even in scenes that lack distinctive texture. There are two key components to 3D reconstruction from RGB-D data: (1) spatial alignment of the surface over time and, (2) fusion of noisy, partial surface measurements into a more complete, consistent 3D model. In the case of static scenes, the sensor is typically moved around the scene and its pose is estimated over time. For dynamic scenes, there may be multiple rigid, articulated, or non-rigidly deforming surfaces to be tracked over time. The fusion component consists of integration of the aligned surface measurements, typically using an intermediate representation, such as the volumetric truncated signed distance field (TSDF). In this chapter, we discuss key recent approaches to 3D reconstruction from depth or RGB-D input, with an emphasis on real-time reconstruction of static scenes.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.