Purpose The 6D pose estimation is a crucial branch of robot vision. However, the authors find that due to the failure to make full use of the complementarity of the appearance and geometry information of the object, the failure to deeply explore the contributions of the features from different regions to the pose estimation, and the failure to take advantage of the invariance of the geometric structure of keypoints, the performances of the most existing methods are not satisfactory. This paper aims to design a high-precision 6D pose estimation method based on above insights. Design/methodology/approach First, a multi-scale cross-attention-based feature fusion module (MCFF) is designed to aggregate the appearance and geometry information by exploring the correlations between appearance features and geometry features in the various regions. Second, the authors build a multi-query regional-attention-based feature differentiation module (MRFD) to learn the contribution of each region to each keypoint. Finally, a geometric enhancement mechanism (GEM) is designed to use structure information to predict keypoints and optimize both pose and keypoints in the inference phase. Findings Experiments on several benchmarks and real robot show that the proposed method performs better than existing methods. Ablation studies illustrate the effectiveness of each module of the authors’ method. Originality/value A high-precision 6D pose estimation method is proposed by studying the relationship between the appearance and geometry from different object parts and the geometric invariance of the keypoints, which is of great significance for various robot applications.
Read full abstract