Six-dimensional (6D) pose estimation is an important branch in the field of robotics focused on enhancing the ability of robots to manipulate and grasp objects. The latest research trend in 6D pose estimation is to directly predict the positions of two-dimensional (2D) keypoints from a single red, green, and blue (RGB) image through convolutional neural networks (CNNs) and establish a corresponding relationship with the three-dimensional (3D) keypoints of the model. Then, the perspective-n-point (PnP) algorithm is used to recover the 6D pose parameters. Currently, two challenges are encountered in pose estimation based on an RGB image. On the one hand, an RGB image lacks depth information, and it is thus difficult to directly obtain the corresponding geometric object information. On the other hand, when depth information is available, it is difficult to efficiently fuse the features of the RGB image with the features of the corresponding depth image. In this paper, we propose a bidirectional depth residual fusion network with a depth prediction (DP) network to estimate the 6D poses of objects (BDR6D). The BDR6D network predicts the depth information of objects using an RGB image, converts the depth information into point cloud information, and performs feature extraction and representation together with the RGB information during the feature extraction and representation stages. Specifically, the RGB image is fed into the BDR6D network, the DP network predicts the depth information of the objects in the image, and the depth map and RGB image are input into a point cloud network (PCN) and CNN, respectively, for feature extraction and representation. We build the bidirectional depth residual (BDR) structure so that the CNN and PCN can share information during feature extraction and representation. This approach allows the two networks to use each other’s local and global information to improve feature extraction and representation. For the keypoint selection stage, we propose an effective 2D keypoint selection method that considers the appearance and geometric information of the object of interest. We evaluate the proposed method with three benchmark datasets and compare it with other 6D pose estimation algorithms. The experimental results show that our method outperforms the state-of-the-art approach. Finally, we deploy our proposed method in conjunction with the Universal Robots 5 manipulator (UR5) robot to grasp and manipulate objects. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Note to Practitioners</i> —The purpose of this paper is to solve the problem of 6D pose estimation for robot grasping. The existing RGB image-based pose estimation approach faces two challenges. On the one hand, a single RGB image lacks depth information, so that it is difficult to directly obtain the corresponding geometric object information. On the other hand, when depth information is available, it is difficult to efficiently fuse the features of the RGB image with the features of the corresponding depth image. To solve the above problems, a novel network that can predict the depth information of objects from an RGB image and fuse the depth information with the RGB information to estimate the 6D pose of objects is proposed. Furthermore, we propose an effective 2D keypoint selection method that considers the appearance and geometric information of objects of interest. We evaluate the proposed approach based on three benchmark datasets and the UR5 robot platform and verify that our method is effective.
Read full abstract