CLA-NeRF: Category-Level Articulated Neural Radiance Field

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

We propose CLA-NeRF - a Category-Level Articulated Neural Radiance Field that can perform view synthesis, part segmentation, and articulated pose estimation. CLA-NeRF is trained at the object category level using no CAD models and no depth, but a set of RGB images with ground truth camera poses and part segments. During inference, it only takes a few RGB views (i.e., few-shot) of an unseen 3D object instance within the known category to infer the object part segmentation and the neural radiance field. Given an articulated pose as input, CLA-NeRF can perform articulation-aware volume rendering to generate the corresponding RGB image at any camera pose. Moreover, the articulated pose of an object can be estimated via inverse rendering. In our experiments, we evaluate the framework across five categories on both synthetic and real-world data. In all cases, our method shows realistic deformation results and accurate articulated pose estimation. We believe that both few-shot articulated object rendering and articulated pose estimation open doors for robots to perceive and interact with unseen articulated objects.

Similar Papers
  • Conference Article
  • Cite Count Icon 65
  • 10.1109/icra46639.2022.9811667
AirDOS: Dynamic SLAM benefits from Articulated Objects
  • May 23, 2022
  • Yuheng Qiu + 4 more

Dynamic Object-aware SLAM (DOS) exploits object-level information to enable robust motion estimation in dynamic environments. Existing methods mainly focus on identifying and excluding dynamic objects from the optimization. In this paper, we show that feature-based visual SLAM systems can also benefit from the presence of dynamic articulated objects by taking advantage of two observations: (1) The 3D structure of each rigid part of articulated object remains consistent over time; (2) The points on the same rigid part follow the same motion. In particular, we present AirDOS, a dynamic object-aware system that introduces rigidity and motion constraints to model articulated objects. By jointly optimizing the camera pose, object motion, and the object 3D structure, we can rectify the camera pose estimation, preventing tracking loss, and generate 4D spatio-temporal maps for both dynamic objects and static scenes. Experiments show that our algorithm improves the robustness of visual SLAM algorithms in challenging crowded urban environments. To the best of our knowledge, AirDOS is the first dynamic object-aware SLAM system demonstrating that camera pose estimation can be improved by incorporating dynamic articulated objects.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/etfa46521.2020.9211967
Kinematic Structures Estimation on the RGB-D Images
  • Sep 1, 2020
  • Rafal Staszak + 4 more

In this paper, we propose a system which detects and estimates the kinematic structures of objects in the indoor environment. We are interested in specific types of objects like doors, sliding doors, and drawers which are common in the human environment and very important taking into account the full autonomy of mobile robots. We assume that the mobile robot is equipped with an RGB-D camera. We utilize a Convolutional Neural Network-based (CNN-based) object detector to locate the articulated objects on the input image created from a pair of RGB-D images. Taking into account strong prior knowledge about the articulated object, we detect the segments on the image which belong to the articulated object. Then, the optimization-based procedure finds the 3D pose and configuration of the joint detected on the scene. We train and verify the method on the images from the Kinect sensor. The performance of the proposed method shows that we can estimate articulated objects in the indoor environment using typical sensors available on the mobile robot.

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/iccvw.2015.111
Reconstruction of Articulated Objects from a Moving Camera
  • Dec 1, 2015
  • Kaan Yucer + 3 more

Many scenes that we would like to reconstruct contain articulated objects, and are often captured by only a single, non-fixed camera. Existing techniques for reconstructing articulated objects either require templates, which can be challenging to acquire, or have difficulties with perspective effects and missing data. In this paper, we present a novel reconstruction pipeline that first treats each feature point tracked on the object independently and incrementally imposes constraints. We make use of the idea that the unknown 3D trajectory of a point tracked in 2D should lie on a manifold that is described by the camera rays going through the tracked 2D positions. We compute an initial reconstruction by solving for latent 3D trajectories that maximize temporal smoothness on these manifolds. We then leverage these 3D estimates to automatically segment an object into piecewise rigid parts, and compute a refined shape and motion using sparse bundle adjustment. Finally, we apply kinematic constraints on automatically computed joint positions to enforce connectivity between different rigid parts, which further reduces ambiguous motion and increases reconstruction accuracy. Each step of our pipeline enforces temporal smoothness, and together results in a high quality articulated object reconstruction. We show the usefulness of our approach in both synthetic and real datasets and compare against other non-rigid reconstruction techniques.

  • Conference Article
  • Cite Count Icon 35
  • 10.1109/iccv48922.2021.01546
Act the Part: Learning Interaction Strategies for Articulated Object Part Discovery
  • Oct 1, 2021
  • Samir Yitzhak Gadre + 2 more

People often use physical intuition when manipulating articulated objects, irrespective of object semantics. Motivated by this observation, we identify an important embodied task where an agent must play with objects to recover their parts. To this end, we introduce Act the Part (AtP) to learn how to interact with articulated objects to discover and segment their pieces. By coupling action selection and motion segmentation, AtP is able to isolate structures to make perceptual part recovery possible without semantic labels. Our experiments show AtP learns efficient strategies for part discovery, can generalize to unseen categories, and is capable of conditional reasoning for the task. Although trained in simulation, we show convincing transfer to real world data with no fine-tuning.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.eswa.2005.09.005
Homomorphic graph matching of articulated objects by an integrated recognition scheme
  • Sep 30, 2005
  • Expert Systems with Applications
  • Chin-Chung Huang + 1 more

Homomorphic graph matching of articulated objects by an integrated recognition scheme

  • Research Article
  • Cite Count Icon 8
  • 10.1109/lra.2023.3313063
Part-Guided 3D RL for Sim2Real Articulated Object Manipulation
  • Nov 1, 2023
  • IEEE Robotics and Automation Letters
  • Pengwei Xie + 8 more

Manipulating unseen articulated objects through visual feedback is a critical but challenging task for real robots. Existing learning-based solutions mainly focus on visual affordance learning or other pre-trained visual models to guide manipulation policies, which face challenges for novel instances in real-world scenarios. In this letter, we propose a novel part-guided 3D RL framework, which can learn to manipulate articulated objects without demonstrations. We combine the strengths of 2D segmentation and 3D RL to improve the efficiency of RL policy training. To improve the stability of the policy on real robots, we design a Frame-consistent Uncertainty-aware Sampling (FUS) strategy to get a condensed and hierarchical 3D representation. In addition, a single versatile RL policy can be trained on multiple articulated object manipulation tasks simultaneously in simulation and shows great generalizability to novel categories and instances. Experimental results demonstrate the effectiveness of our framework in both simulation and real-world settings.

  • Conference Article
  • Cite Count Icon 395
  • 10.1109/cvpr.2003.1211340
Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture
  • Jun 18, 2003
  • K.M.G Cheung + 2 more

Shape-from-silhouette (SFS), also known as visual hull (VH) construction, is a popular 3D reconstruction method, which estimates the shape of an object from multiple silhouette images. The original SFS formulation assumes that the entire silhouette images are captured either at the same time or while the object is static. This assumption is violated when the object moves or changes shape. Hence the use of SFS with moving objects has been restricted to treating each time instant sequentially and independently. Recently we have successfully extended the traditional SFS formulation to refine the shape of a rigidly moving object over time. We further extend SFS to apply to dynamic articulated objects. Given silhouettes of a moving articulated object, the process of recovering the shape and motion requires two steps: (1) correctly segmenting (points on the boundary of) the silhouettes to each articulated part of the object, (2) estimating the motion of each individual part using the segmented silhouette. In this paper, we propose an iterative algorithm to solve this simultaneous assignment and alignment problem. Once we have estimated the shape and motion of each part of the object, the articulation points between each pair of rigid parts are obtained by solving a simple motion constraint between the connected parts. To validate our algorithm, we first apply it to segment the different body parts and estimate the joint positions of a person. The acquired kinematic (shape and joint) information is then used to track the motion of the person in new video sequences.

  • Conference Article
  • Cite Count Icon 71
  • 10.1109/iros47612.2022.9981779
Articulated Object Interaction in Unknown Scenes with Whole-Body Mobile Manipulation
  • Oct 23, 2022
  • Mayank Mittal + 4 more

A kitchen assistant needs to operate human-scale objects, such as cabinets and ovens, in unmapped environments with dynamic obstacles. Autonomous interactions in such environments require integrating dexterous manipulation and fluid mobility. While mobile manipulators in different form factors provide an extended workspace, their real-world adoption has been limited. Executing a high-level task for general objects requires a perceptual understanding of the object as well as adaptive whole-body control among dynamic obstacles. In this paper, we propose a two-stage architecture for autonomous interaction with large articulated objects in unknown environments. The first stage, object-centric planner, only focuses on the object to provide an action-conditional sequence of states for manipulation using RGB-D data. The second stage, agent-centric planner, formulates the whole-body motion control as an optimal control problem that ensures safe tracking of the generated plan, even in scenes with moving obstacles. We show that the proposed pipeline can handle complex static and dynamic kitchen settings for both wheel-based and legged mobile manipulators. Compared to other agent-centric planners, our proposed planner achieves a higher success rate and a lower execution time. We also perform hardware tests on a legged mobile manipulator to interact with various articulated objects in a kitchen. For additional material, please check: www.pair.toronto.edularticulated-mm/.

  • Research Article
  • Cite Count Icon 37
  • 10.1109/tip.2021.3138644
Toward Real-World Category-Level Articulation Pose Estimation.
  • Jan 1, 2022
  • IEEE Transactions on Image Processing
  • Liu Liu + 4 more

Human life is populated with articulated objects. Current Category-level Articulation Pose Estimation (CAPE) methods are studied under the single-instance setting with a fixed kinematic structure for each category. Considering these limitations, we aim to study the problem of estimating part-level 6D pose for multiple articulated objects with unknown kinematic structures in a single RGB-D image, and reform this problem setting for real-world environments and suggest a CAPE-Real (CAPER) task setting. This setting allows varied kinematic structures within a semantic category, and multiple instances to co-exist in an observation of real world. To support this task, we build an articulated model repository ReArt-48 and present an efficient dataset generation pipeline, which contains Fast Articulated Object Modeling (FAOM) and Semi-Authentic MixEd Reality Technique (SAMERT). Accompanying the pipeline, we build a large-scale mixed reality dataset ReArtMix and a real world dataset ReArtVal. Accompanying the CAPER problem and the dataset, we propose an effective framework that exploits RGB-D input to estimate part-level pose for multiple instances in a single forward pass. In our method, we introduce object detection from RGB-D input to handle the multi-instance problem and segment each instance into several parts. To address the unknown kinematic structure issue, we propose an Articulation Parsing Network to analyze the structure of detected instance, and also build a Pair Articulation Pose Estimation module to estimate per-part 6D pose as well as joint property from connected part pairs. Extensive experiments demonstrate that the proposed method can achieve good performance on CAPER, CAPE and instance-level Robot Arm pose estimation problems. We believe it could serve as a strong baseline for future research on the CAPER task. The datasets and codes in our work will be made publicly available.

  • Conference Article
  • 10.1109/ismar.2011.6092397
Graph-cut-based 3D model segmentation for articulated object reconstruction
  • Oct 1, 2011
  • Inkyu Han + 2 more

The three-dimensional (3D) reconstruction of objects has been well studied in the literature of augmented reality (AR) [1, 2]. Most existing studies have assumed that the to-be-constructed target object is rigid, whereas objects in the real world can be dynamic or deformable. Therefore, AR systems are required to deal with non-rigid objects to be adaptive to environmental changes. In this paper, we address the problem of reconstructing articulated objects as a starting point for modeling deformable objects. An articulated object is composed of partially rigid components linked with joints. After building a mesh model of the object, the model is segmented into the components along their boundaries by a graph-cut-based approach that we propose.

  • Conference Article
  • Cite Count Icon 64
  • 10.1109/cvpr.2006.66
Automatic Kinematic Chain Building from Feature Trajectories of Articulated Objects
  • Jun 17, 2006
  • Jingyu Yan + 1 more

We investigate the problem of learning the structure of an articulated object, i.e. its kinematic chain, from feature trajectories under affine projections. We demonstrate this possibility by proposing an algorithm which first segments the trajectories by local sampling and spectral clustering, then builds the kinematic chain as a minimum spanning tree of a graph constructed from the segmented motion subspaces. We test our method in challenging data sets and demonstrate the ability to automatically build the kinematic chain of an articulated object from feature trajectories. The algorithm also works when there are multiple articulated objects in the scene. Furthermore, we take into account non-rigid articulated parts that exist in human motions. We believe this advance will have impact on articulated object tracking and dynamical structure from motion.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/cvpr.2012.6247669
Scale resilient, rotation invariant articulated object matching
  • Jun 1, 2012
  • Hao Jiang + 3 more

A novel method is proposed for matching articulated objects in cluttered videos. The method needs only a single exemplar image of the target object. Instead of using a small set of large parts to represent an articulated object, the proposed model uses hundreds of small units to represent walks along paths of pixels between key points on an articulated object. Matching directly on dense pixels is key to achieving reliable matching when motion blur occurs. The proposed method fits the model to local image properties, conforms to structure constraints, and remembers the steps taken along a pixel path. The model formulation handles variations in object scaling, rotation and articulation. Recovery of the optimal pixel walks is posed as a special shortest path problem, which can be solved efficiently via dynamic programming. Further speedup is achieved via factorization of the path costs. An efficient method is proposed to find multiple walks and simultaneously match multiple key points. Experiments show that the proposed method is efficient and reliable and can be used to match articulated objects in fast motion videos with strong clutter and blurry imagery.

  • Dissertation
  • 10.32657/10356/200242
Radiance fields for 3D scene representation and rendering
  • Jan 1, 2024
  • Jiahui Zhang

3D scene representation and rendering have been pivotal tasks in 3D computer vision and computer graphics, essential for various applications such as virtual reality, augmented reality, and autonomous driving. As leading radiance field methods, neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) have recently achieved high-quality 3D scene representations by using MLPs and 3D primitives, respectively. In addition, they also achieve state-of-the-art scene rendering for novel view synthesis based on volume rendering and rasterization, respectively. Despite the significant progress in 3D scene representation and rendering, NeRF and 3DGS still face many challenges. First, proper NeRF training and high-quality scene representation and rendering depend on either reasonable camera pose initialization or manually-crafted camera pose distributions which are often unavailable, or hard to acquire in various real-world data. While Structure-from-Motion is frequently adopted to pre-compute camera poses, it is time-consuming and lacks differentiability which impedes the research and development of NeRF-based methods. The second is the domain gap issue in pose-free NeRF. One typical pipeline of pose-free NeRFs first trains a pose estimator with rendered images and then performs joint optimization of NeRF model and camera poses of real images predicted by the pose estimator. However, it relies solely on rendered images to train camera pose estimator, which often leads to biased and inaccurate camera pose estimation due to the domain gap between rendered and real images. This discrepancy can further result in local minima in the joint optimization of camera pose and NeRF scene representations. Third, 3DGS often suffers from an over-reconstruction issue during Gaussian densification, leading to suboptimal 3D scene representations and undesirable scene rendering with artifacts and blurred details. Fourth, 3DGS often comes with a large model size due to a large number of parameterized primitives required for explicit scene representations. While anchor-based 3DGS reduces 3D Gaussian redundancy, it often encounters the dilemma among anchor feature dimensions, model size and rendering quality. Large anchor feature size facilitates high-quality rendering but increases the model size due to numerous anchor points used in scene representation, whereas reducing feature size hinders accurate Gaussian prediction and leads to artifacts in rendered textures and structures. One significant challenge is thus to achieve high-quality scene representation and rendering with compact model size. In this thesis, we propose several innovative NeRF and 3DGS techniques that address the above issues successfully with superior 3D scene representation and rendering. First, we design a view matching NeRF (VMRF) that achieves superior NeRF representations without priors on camera poses or hand-crafted camera pose distributions. By leveraging unbalanced optimal transport, VMRF establishes feature correspondences between cross-view images to estimate relative camera poses, effectively mitigating reliance on prior pose information and distributions. Second, we propose IR-NeRF, a scene codebook-based implicit pose regularization framework for pose-free NeRF. IR-NeRF first constructs a scene codebook from unposed real images to store scene features and capture the scene-specific camera pose distribution implicitly as priors. It then employs the scene priors as regularization for promoting the robustness of camera pose estimation for real images and further improving the joint optimization of NeRF and camera poses. Third, we propose FreGS, an innovative 3D Gaussian splatting technique that addresses the over-reconstruction issue from frequency space. FreGS introduces a novel frequency annealing technique to achieve progressive frequency regularization, enabling coarse-to-fine Gaussian densification. It effectively improves the Gaussian densification, resulting in superior 3DGS-based scene representations and rendering for novel view synthesis. Fourth, we design SOGS, an advanced 3D Gaussian splatting technique that introduces second-order anchors to achieve superior rendering quality with reduced model size simultaneously. SOGS incorporates covariance-based second-order statistics to perform anchor feature augmentation, compensating for the reduced model size and improving the scene representation and rendering quality effectively. Overall, extensive experiments demonstrate that the proposed NeRF-based and 3DGS-based methods have effectively addressed or mitigated the aforementioned issues and achieved superior 3D scene representation and rendering.

  • Research Article
  • Cite Count Icon 5
  • 10.1016/j.neucom.2024.128041
Zero‐Shot 3D Pose Estimation of Unseen Object by Two‐step RGB-D Fusion
  • Jun 10, 2024
  • Neurocomputing
  • Guifang Duan + 5 more

Zero‐Shot 3D Pose Estimation of Unseen Object by Two‐step RGB-D Fusion

  • Conference Article
  • Cite Count Icon 72
  • 10.1109/cvpr.2005.414
Articulated Pose Estimation in a Learned Smooth Space of Feasible Solutions
  • Jun 20, 2005
  • Tai-Peng Tian + 2 more

A learning based framework is proposed for estimating human body pose from a single image. Given a differentiable function that maps from pose space to image feature space, the goal is to invert the process: estimate the pose given only image features. The inversion is an ill-posed problem as the inverse mapping is a one to many process, hence multiple solutions exist. It is desirable to restrict the solution space to a smaller subset of feasible solutions. The space of feasible solutions may not admit a closed form description. The proposed framework seeks to learn an approximation over such a space. Using Gaussian Process Latent Variable Modelling. The scaled conjugate gradient method is used to find the best matching pose in the learned space. The formulation allows easy incorporation of various constraints for more accurate pose estimation. The performance of the proposed approach is evaluated in the task of upper-body pose estimation from silhouettes and compared with the Specialized Mapping Architecture. The proposed approach performs better than the latter approach in terms of estimation accuracy with synthetic data and qualitatively better results with real video of humans performing gestures.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant