STaR: Self-supervised Tracking and Reconstruction of Rigid Objects in Motion with Neural Rendering
We present STaR, a novel method that performs Self-supervised Tracking and Reconstruction of dynamic scenes with rigid motion from multi-view RGB videos without any manual annotation. Recent work has shown that neural networks are surprisingly effective at the task of compressing many views of a scene into a learned function which maps from a viewing ray to an observed radiance value via volume rendering. Unfortunately, these methods lose all their predictive power once any object in the scene has moved. In this work, we explicitly model rigid motion of objects in the context of neural representations of radiance fields. We show that without any additional human specified supervision, we can reconstruct a dynamic scene with a single rigid object in motion by simultaneously decomposing it into its two constituent parts and encoding each with its own neural representation. We achieve this by jointly optimizing the parameters of two neural radiance fields and a set of rigid poses which align the two fields at each frame. On both synthetic and real world datasets, we demonstrate that our method can render photorealistic novel views, where novelty is measured on both spatial and temporal axes. Our factored representation furthermore enables animation of unseen object motion.
- Conference Article
74
- 10.1109/iccv.2015.109
- Dec 1, 2015
This paper introduces a general approach to dynamic scene reconstruction from multiple moving cameras without prior knowledge or limiting constraints on the scene structure, appearance, or illumination. Existing techniques or dynamic scene reconstruction from multiple wide-baseline camera views primarily focus on accurate reconstruction in controlled environments, where the cameras are fixed and calibrated and background is known. These approaches are not robust for general dynamic scenes captured with sparse moving cameras. Previous approaches for outdoor dynamic scene reconstruction assume prior knowledge of the static background appearance and structure. The primary contributions of this paper are twofold: an automatic method for initial coarse dynamic scene segmentation and reconstruction without prior knowledge of background appearance or structure, and a general robust approach for joint segmentation refinement and dense reconstruction of dynamic scenes from multiple wide-baseline static or moving cameras. Evaluation is performed on a variety of indoor and outdoor scenes with cluttered backgrounds and multiple dynamic non-rigid objects such as people. Comparison with state-of-the-art approaches demonstrates improved accuracy in both multiple view segmentation and dense reconstruction. The proposed approach also eliminates the requirement for prior knowledge of scene structure and appearance.
- Research Article
15
- 10.1371/journal.pone.0055586
- May 7, 2013
- PLoS ONE
Remote dynamic three-dimensional (3D) scene reconstruction renders the motion structure of a 3D scene remotely by means of both the color video and the corresponding depth maps. It has shown a great potential for telepresence applications like remote monitoring and remote medical imaging. Under this circumstance, video-rate and high resolution are two crucial characteristics for building a good depth map, which however mutually contradict during the depth sensor capturing. Therefore, recent works prefer to only transmit the high-resolution color video to the terminal side, and subsequently the scene depth is reconstructed by estimating the motion vectors from the video, typically using the propagation based methods towards a video-rate depth reconstruction. However, in most of the remote transmission systems, only the compressed color video stream is available. As a result, color video restored from the streams has quality losses, and thus the extracted motion vectors are inaccurate for depth reconstruction. In this paper, we propose a precise and robust scheme for dynamic 3D scene reconstruction by using the compressed color video stream and their inaccurate motion vectors. Our method rectifies the inaccurate motion vectors by analyzing and compensating their quality losses, motion vector absence in spatial prediction, and dislocation in near-boundary region. This rectification ensures the depth maps can be compensated in both video-rate and high resolution at the terminal side towards reducing the system consumption on both the compression and transmission. Our experiments validate that the proposed scheme is robust for depth map and dynamic scene reconstruction on long propagation distance, even with high compression ratio, outperforming the benchmark approaches with at least 3.3950 dB quality gains for remote applications.
- Research Article
4
- 10.3390/app15084190
- Apr 10, 2025
- Applied Sciences
This paper presents a novel 3D Gaussian Splatting (3DGS)-based Simultaneous Localization and Mapping (SLAM) system that integrates Light Detection and Ranging (LiDAR) and vision data to enhance dynamic scene tracking and reconstruction. Existing 3DGS systems face challenges in sensor fusion and handling dynamic objects. To address these, we introduce a hybrid uncertainty-based 3D segmentation method that leverages uncertainty estimation and 3D object detection, effectively removing dynamic points and improving static map reconstruction. Our system also employs a sliding window-based keyframe fusion strategy that reduces computational load while maintaining accuracy. By incorporating a novel dynamic rendering loss function and pruning techniques, we suppress artifacts such as ghosting and ensure real-time operation in complex environments. Extensive experiments show that our system outperforms existing methods in dynamic object removal and overall reconstruction quality. The key innovations of our work lie in its integration of hybrid uncertainty-based segmentation, dynamic rendering loss functions, and an optimized sliding window strategy, which collectively enhance robustness and efficiency in dynamic scene reconstruction. This approach offers a promising solution for real-time robotic applications, including autonomous navigation and augmented reality.
- Conference Article
55
- 10.1109/cvpr.2017.592
- Jul 1, 2017
In this paper we propose a framework for spatially and temporally coherent semantic co-segmentation and reconstruction of complex dynamic scenes from multiple static or moving cameras. Semantic co-segmentation exploits the coherence in semantic class labels both spatially, between views at a single time instant, and temporally, between widely spaced time instants of dynamic objects with similar shape and appearance. We demonstrate that semantic coherence results in improved segmentation and reconstruction for complex scenes. A joint formulation is proposed for semantically coherent object-based co-segmentation and reconstruction of scenes by enforcing consistent semantic labelling between views and over time. Semantic tracklets are introduced to enforce temporal coherence in semantic labelling and reconstruction between widely spaced instances of dynamic objects. Tracklets of dynamic objects enable unsupervised learning of appearance and shape priors that are exploited in joint segmentation and reconstruction. Evaluation on challenging indoor and outdoor sequences with hand-held moving cameras shows improved accuracy in segmentation, temporally coherent semantic labelling and 3D reconstruction of dynamic scenes.
- Research Article
- 10.1609/aaai.v39i9.33045
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
While Neural Radiance Fields (NeRFs) have advanced the frontiers of novel view synthesis (NVS) using LiDAR data, they still struggle in dynamic scenes. Due to the low frequency and sparsity characteristics of LiDAR point clouds, it is challenging to spontaneously learn a dynamic and consistent scene representation from posed scans. In this paper, we propose STGC-NeRF, a novel LiDAR NeRF method that combines spatial-temporal geometry consistency to enhance the reconstruction of dynamic scenes. First, we propose a temporal geometry consistency regularization to enhance the regression of time-varying scene geometries from low-frequency LiDAR sequences. By estimating the pointwise correspondences between synthetic (or real) and real frames at different times, we convert them into various forms of temporal supervision. This alleviates the inconsistency caused by moving objects in dynamic scenes. Second, to improve the reconstruction of sparse LiDAR data, we propose spatial geometric consistency constraints. By computing multiple neighborhood feature descriptors incorporating geometric and contextual information, we capture structural geometry information from sparse LiDAR data. This helps encourage consistent direction, smoothness, and detail of the local surface. Extensive experiments on the KITTI-360 and nuScenes datasets demonstrate that STGC-NeRF outperforms state-of-the-art methods in both geometry and intensity accuracy for dynamic LiDAR scene reconstruction.
- Research Article
46
- 10.2514/3.5881
- Jul 1, 1970
- AIAA Journal
D functions used to construct the stiffness matrix of a finite element should possess the following properties. 1) Infinitesimal rigid body motions should be accurately represented. If this requirement is not met the conditions of equilibrium of the element are not satisfied.*• 2) The displacement functions should contain all the lower terms of a complete set of functions. This requirement insures monotonic convergence by mesh size reduction. 3) A minimum degree of interelement continuity must be maintained between adjacent elements. This minimum degree of compatibility must insure a perfect match for the inplane and the out of plane components of displacement. Also for the out of plane component, slopes tangent and normal to all common edges of two adjacent elements must match. This requirement then insures convergence to an exact result by mesh size reduction. The importance of the last two requirements is firmly established; however, the first requirement has been shown to be problem dependent. If the structure to be analyzed is so constrained that no element of the structure is ever going to undergo any rigid body motion, then obviously this requirement can be violated. For example, axisymmetric elements acted upon by axisymmetric loads need to have only one rigid body mode: a rigid translation parallel to the axis of symmetry. For this particular type of element a truncated cone as used by Grafton and Strome always includes a rigid body motion parallel to the longitudinal axis. However, if the axisymmetric element is to have curvature in the longitudinal direction, then all the rigid body modes are absent. Jones and Strome recognized such a deficiency and reintroduced a longitudinal translation in their element. Later, Stricklin et al. reported on a similar improved element but omitted the longitudinal rigid body motion altogether. This last element is capable of handling asymmetric loading, therefore it is not difficult to imagine a loading in which many elements would have to undergo considerable transverse motion; a cantilevered structure would lead to such a situation. Haisler and Stricklin studied the influence of longitudinal translation and observed that such a rigid motion is recuperated by mesh size reduction. For elements of rectangular aspects, Bogner, Fox, and Schmit developed a systematic method for constructing acceptable displacement fields. However, for curved cylindrical elements, only two rigid body modes are accounted for. The same authors reported on a (48 X 48) stiffness matrix and mentioned that an eigenvalue analysis of such a matrix indicated that rigid body motions were adequately represented. However, as pointed out in our study of curved cylindrical elements, rigid body motions cannot be represented by independent displacement components. In the same reference, the importance of these rigid body motions is clearly illustrated in several examples. However, the inclusion of rigid body motions was done at the expense of rigorous interelement compatibility. This compromise resulted in a significant improvement in the behavior of the element. In this paper we develop a method to include rigid body motions without comprising deformational compatibility. The method is general and can be applied without difficulties to any element, curved or flat. The improvements of a curved cylindrical element are illustrated with one example.
- Research Article
39
- 10.1016/j.cosrev.2020.100338
- Dec 5, 2020
- Computer Science Review
Real-time 3D reconstruction techniques applied in dynamic scenes: A systematic literature review
- Research Article
4
- 10.1145/3658229
- Jul 19, 2024
- ACM Transactions on Graphics
The generation of global illumination in real time has been a long-standing challenge in the graphics community, particularly in dynamic scenes with complex illumination. Recent neural rendering techniques have shown great promise by utilizing neural networks to represent the illumination of scenes and then decoding the final radiance. However, incorporating object parameters into the representation may limit their effectiveness in handling fully dynamic scenes. This work presents a neural rendering approach, dubbed LightFormer , that can generate realistic global illumination for fully dynamic scenes, including dynamic lighting, materials, cameras, and animated objects, in real time. Inspired by classic many-lights methods, the proposed approach focuses on the neural representation of light sources in the scene rather than the entire scene, leading to the overall better generalizability. The neural prediction is achieved by leveraging the virtual point lights and shading clues for each light. Specifically, two stages are explored. In the light encoding stage, each light generates a set of virtual point lights in the scene, which are then encoded into an implicit neural light representation, along with screen-space shading clues like visibility. In the light gathering stage, a pixel-light attention mechanism composites all light representations for each shading point. Given the geometry and material representation, in tandem with the composed light representations of all lights, a lightweight neural network predicts the final radiance. Experimental results demonstrate that the proposed LightFormer can yield reasonable and realistic global illumination in fully dynamic scenes with real-time performance.
- Research Article
11
- 10.1002/cav.209
- Jul 31, 2007
- Computer Animation and Virtual Worlds
Computer graphics animation often lacks interaction between rigid object and granular material. In this paper, we propose a method for the deformation of the ground surface that consists of granular material when it is penetrated by a rigid body object in motion. Meanwhile, the motion of the rigid object is also affected due to the collision with the ground surface. Our simulation model concerns: updating the motion of object, the collision detection between the rigid object and the ground surface, the distribution of the ground granular material and the deformation of the ground surface. Our contribution is that we present a method to simulate the interaction between the ground granular material and the rigid body object in motion. Moreover, a render to texture method is presented to accelerate the ray casting collision detection between the ground surface and the object. And, our implementation for the method can be simulated at interactive frame rates. Copyright © 2007 John Wiley & Sons, Ltd.
- Conference Article
5
- 10.1109/cmvit.2017.8
- Feb 1, 2017
Detection and tracking of moving objects in video is emerging in computer vision and robotics. Classification of objects in addition to the detection and tracking can lead to better understanding of the scene and helps in taking certain decisions in various applications. Classifying the objects based on the features like color, shape, size, speed, direction of objects in motion has numerous applications. In this paper, we consider the size of an object in motion. We present a model to estimate the size of an object in motion using optical flow technique, and present some applications in watermarking, steganography and in robotics where the size of an object in motion is an important parameter to be considered.
- Conference Article
5
- 10.1109/icecit.2017.8456442
- Dec 1, 2017
Computer vision and Robotics are promising areas in which several applications are explored during the last decade. Detection, tracking of moving objects, and classification of objects helps us in taking certain decisions in various applications. Classification of objects based on the features such as color, shape, size, speed, direction of objects in motion, etc. has numerous applications. In this paper, the Sagar G. et al. scheme [2] is extended to find the size of an object in motion which has applications in watermarking, steganography and in robotics. The proposed scheme uses K-means clustering technique to segment the objects in motion during the process of estimating the size of an object in motion.
- Conference Article
1
- 10.1109/icme.2006.262421
- Jul 1, 2006
In this article, we discuss 3D shape reconstruction of an object in a rigid motion with the volume intersection method. When the object moves rigidly, the cameras change their relative positions to the object at every moment. To estimate the motion correctly, we propose new feature points called outcrop points on the reconstructed 3D shape. These points are guaranteed to be located on the real surface of the object. If the rigid motion of the object can be correctly estimated, cameras at different moments serve as the cameras in different positions virtually. With these cameras in time sequences, we can increase accuracy of the reconstructed 3D shape without increasing the number of cameras. Based on this idea, we reconstruct an accurate shape of the object in motion from images obtained by limited number of cameras. As the result, we can acquire an accurate shape from images in time sequences. 1.
- Research Article
- 10.1109/tpami.2025.3629570
- Mar 1, 2026
- IEEE transactions on pattern analysis and machine intelligence
3D scene flow represents the dense per-point motion field in dynamic scenes, playing a crucial role in various downstream tasks, including motion segmentation, dynamic scene reconstruction, 4D content generation, etc. However, previous regression-based works commonly suffer from unreliable correlations caused by locally constrained search ranges and struggle with the absence of timely feedback regarding the flow estimation uncertainty during training. To address these challenges, we propose a novel uncertainty-aware network for scene flow estimation, termed DifFlow3D, based on the conditional probabilistic diffusion model. Hierarchical diffusion-based flow estimation blocks are designed to enhance the correlation robustness and resilience to challenging cases, e.g., dynamics, noisy inputs, repetitive patterns, etc. To mitigate the generation diversity, three key flow-related features are leveraged as conditions in our diffusion model. Furthermore, we develop an uncertainty estimation module within diffusion to assess the reliability of estimated scene flow dynamically. A Hidden State Denoising strategy (HSD) is also introduced to further boost the stability of the reverse denoising process. Extensive experiments conducted on four scene flow datasets, including both synthetic and real-world datasets (FlyingThings3D, KITTI 2015, Argoverse, and Waymo Open), demonstrate the superiority of our proposed DifFlow3D. Compared to prior state-of-the-art methods, DifFlow3D has 26.0%, 36.4%, 35.3%, and 17.7% EPE3D reduction respectively across four datasets. Only trained on the synthetic FlyingThings3D dataset, our method achieves an unprecedented millimeter-level accuracy (0.0070 m EPE3D) on the real-scene KITTI dataset, highlighting its exceptional generalization capability. Additionally, our diffusion-based refinement paradigm can be seamlessly integrated as a plug-and-play module into existing scene flow networks, significantly enhancing their estimation accuracy. We also introduce our pre-trained scene flow estimator as explicit motion priors into the novel dynamic LiDAR view synthesis task, which validates its great potential for improving the 4D LiDAR reconstruction performance.
- Research Article
4
- 10.1016/j.imavis.2024.104913
- Jan 20, 2024
- Image and Vision Computing
Three dimensional tracking of rigid objects in motion using 2D optical flows
- Research Article
4
- 10.1007/s10043-015-0131-4
- Aug 12, 2015
- Optical Review
Motion object tracking is one of the most important research directions in computer vision. Challenges in designing a tracking method are usually caused by occlusions, noise, or illumination changes. In this paper, a robust visual tracking algorithm is proposed in order to cope with the occlusion by introducing the motion object tracking issue as a low-rank matrix representation problem. First, being the main contribution of this paper, the observation matrix composed by image sequences is decomposed into a low-rank matrix and a sparse matrix. The motion object in the image sequence forms the low-rank matrix and the occlusion on the motion object forms the sparse matrix. Then the motion object tracking is carried out using a Bayesian state under the particle filter framework. Finally, an effective alternating algorithm is utilized to solve the proposed optimization formulation. The proposed algorithm has been examined throughout several challenging image sequences, and experiment results show that it works effectively and efficiently in different situations.