Detecting Object Surface Keypoints From a Single RGB Image via Deep Learning Network for 6-DoF Pose Estimation

Lee Aing,Wen-Nung Lie

doi:10.1109/access.2021.3082406

Abstract

Estimating the 6-DoF (Degree of Freedom) object pose from a single RGB image is one of the challenging tasks in the field of computer vision. Before the pose which is defined as the translation and rotation parameters can be derived by the traditional PnP algorithm, 2D image projections of a set of 3D object keypoints must be accurately detected. In this paper, we present techniques for defining 3D object surface keypoints and predicting their corresponding 2D counterparts via deep-learning network architectures. The main technique to designate 3D object keypoints is to employ quadratic fitting scheme for calculating the principal surface curvatures as the weights and then select from all surface points the ones mostly distributive with larger curvatures to describe the object shape as possible. However, the 2D projected keypoints are not directly regressed from the network, but encoded as the unit vector fields pointing to them, so that the voting scheme to recover back those 2D keypoints can be performed. Moreover, an effective loss function with the regularization term is adopted in training ResNet for predicting image projections of object keypoints by focusing on small-scale errors. Experimental results show that our proposed technique outperforms state-of-the-art approaches in both “2D projection” and “3D transformation” metrics.

Highlights

The 6-DoF object pose estimation is a process to identify the orientation and translation of the target object, in order to understand more about the 3D scenes
In Curvature Point Sampling (CPS), we set the parameter K = 9 to have a total of 9 object surface keypoints, including one from the object centroid
In choosing the similarity threshold τ, if it is set lower, the voting quality which is referred to the unit vector accuracy will be low; otherwise, we will have a risk of less number of supporting voters when applying least-squares fitting

Summary

Introduction

The 6-DoF object pose estimation is a process to identify the orientation and translation of the target object (with respect to a pre-built 3D model), in order to understand more about the 3D scenes. One of the efficient approaches based on deep neural network to train and estimate eight projected corners of an object’s 3D bounding box in a 2D image was first proposed in [8], [9]. Those estimated 2D projection corners were used to align with the corresponding 3D corners of the target object model by operating the PnP solver. Keypoints [10] around the object surface, in contrast to the bounding box corners, is more informative to form a feature point set for describing the object shape

Objectives

Methods

Results

Conclusion