Efficient monocular coarse-to-fine object pose estimation

Rong Feng,Hong Zhang

doi:10.1109/icma.2016.7558806

Abstract

The vision and robotics communities have developed different methods for object pose estimation, all of which have their disadvantages and advantages. A popular method saves all possible object model images from different viewpoints and their 2D-to-3D correspondences in database off-line. Then local feature matching is applied between the current view and the model images in the database. For the top matched image, the approach of a PnP algorithm followed by RANSAC is used to estimate object pose. Such a method has good accuracy, but lacks efficiency, consuming O(MN2) time where N and M are the number of features in a model and the number of models, respectively. To tackle this problem, we propose a method that improves the efficiency in two ways. First, we employ a hierarchical clustering method to find the proper number of model images to represent each object, leading to a decrease in M. Second, a coase-to-fine object pose estimation method is proposed, to decrease the time to find the best matching model image. Specifically, in the coarse step, given an image, the most similar model image is retrieved using a global image descriptor, which we compute using a pre-trained deep neural network. Then in the fine step, a local descriptor feature matching method is applied to find matching keypoints between current image and the model image found in the coarse step. Finally, with pre-registered 2D-to-3D correspondences for each model, an accurate object pose is calculated using the PnP and RANSAC approach. The performance of our method is evaluated on the Amazon Picking Challenge dataset.

Full Text