The main goal of six-dimensional pose estimation is to accurately ascertain the location and orientation of an object in three-dimensional space, which has a wide range of applications in the field of artificial intelligence. Due to the relative sparseness of the point cloud data captured by the depth camera, the ability of models to fully understand the shape, structure, and other features of the object is hindered. Consequently, the model exhibits weak generalization when faced with objects with significant shape differences in the new scene. The deep integration of feature levels and the mining of local and global information can effectively alleviate the influence of the above factors. To solve these problems, we propose a new Two-Stage Geometric Neighborhood Fusion Network for category-level 6D pose estimation (TGNF-Net) to estimate objects that have not appeared in the training phase, which strengthens the fusion capacity of feature points within a specific range of neighborhoods, enabling the feature points to be more sensitive to both local and global geometric information. Our approach includes a neighborhood information fusion module, which can effectively utilize neighborhood information to enrich the feature set of different modal data and overcome the problem of heterogeneity between image and point cloud data. In addition to this, we design a two-stage geometric information embedding module, which can effectively fuse geometric information of the multi-scale range into keypoint features. This way enhances the robustness of the model and enables the model to exhibit stronger generalization capabilities when faced with unknown or complex scenes. These two strategies enhance the expression of features and make NOCS coordinate predictions more accurate. Many experiments show that our approach is superior to other classical methods on the CAMERA25, REAL275, HouseCat6D, and Omni6DPose datasets.
Read full abstract