Abstract

Recent studies have shown that deep learning achieves superior results in the task of estimating 6D-pose of target object from an image. End-to-end techniques use deep networks to predict pose directly from image, avoiding the limitations of handcraft features, but rely on training dataset to deal with occlusion. Two-stage algorithms alleviate this problem by finding keypoints in the image and then solving the Perspective-n-Point (PnP) problem to avoid directly fitting the transformation from image space to 6D-pose space. This paper proposes a novel two-stage method using only local features for pixel voting, called Region Pixel Voting Network (RPVNet). Front-end network detects target object and predicts its direction maps, from which the keypoints are recovered by pixel voting using Random Sample Consensus (RANSAC). The backbone, object detection network and mask prediction network of RPVNet are designed based on Mask R-CNN. Direction map is a vector field with the direction of each point pointing to its source keypoint. It is shown that predicting an object’s keypoints is related to its own pixels and independent of other pixels, which means the influence of occlusion decreases in the object’s region. Based on this phenomenon, in RPVNet, local features instead of the whole features, i.e., the output of the backbone, are used by a well-designed Convolutional Neural Networks (CNN) to compute direction maps. The local features are extracted from the whole features through RoIAlign, based on the region provided by detection network. Experiments on LINEMOD dataset show that RPVNet’s average accuracy (86.1%) is almost equal to state-of-the-art (86.4%) when no occlusion occurs. Meanwhile, results on Occlusion LINEMOD dataset show that RPVNet outperforms state-of-the-art (43.7% vs. 40.8%) and is more accurate for small object in occluded scenes.

Highlights

  • Estimating the accurate 6D-Pose of certain objects has important implications in industries such as e-commerce and logistics [1], as pose information helps robotic systems to better manipulate materials

  • LINEMOD has an average of 1200 images per type of object, but only 150 images of each type are selected for training

  • 40–60% of the area of the target object is visible after cropping. This results in a very high pixel deficiency of the target object. This phenomenon is not easy to occur in practice and this paper is aimed at improving the accuracy of pose estimation when occlusion happens, we still conducted experiments on Truncation LINEMOD as an additional exploration

Read more

Summary

Introduction

Estimating the accurate 6D-Pose of certain objects has important implications in industries such as e-commerce and logistics [1], as pose information helps robotic systems to better manipulate materials. Monocular camera is one of the most common used sensors in pose estimation, due to its low cost, rich information and easy installation. Since the 3D coordinates of spatial points cannot be obtained directly by a monocular camera, estimating the 6D-pose of an object from an image is a great challenge in research. The correspondence between the points on the model and the points on the picture is firstly determined. Traditional methods use handcraft features [3,4,5,6] to detect the model’s corresponding keypoints in the image. Typical methods [7] extract prior templates from images of objects taken in various poses and estimates the poses of objects in image by matching these templates. Similar to handcraft-features, such templates are sensitive to environmental change

Objectives
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.