Region Pixel Voting Network (RPVNet) for 6D Pose Estimation from Monocular Image

Feng Xiong,Qijun Chen,Chengju Liu

doi:10.3390/app11020743

Abstract

Recent studies have shown that deep learning achieves superior results in the task of estimating 6D-pose of target object from an image. End-to-end techniques use deep networks to predict pose directly from image, avoiding the limitations of handcraft features, but rely on training dataset to deal with occlusion. Two-stage algorithms alleviate this problem by finding keypoints in the image and then solving the Perspective-n-Point (PnP) problem to avoid directly fitting the transformation from image space to 6D-pose space. This paper proposes a novel two-stage method using only local features for pixel voting, called Region Pixel Voting Network (RPVNet). Front-end network detects target object and predicts its direction maps, from which the keypoints are recovered by pixel voting using Random Sample Consensus (RANSAC). The backbone, object detection network and mask prediction network of RPVNet are designed based on Mask R-CNN. Direction map is a vector field with the direction of each point pointing to its source keypoint. It is shown that predicting an object’s keypoints is related to its own pixels and independent of other pixels, which means the influence of occlusion decreases in the object’s region. Based on this phenomenon, in RPVNet, local features instead of the whole features, i.e., the output of the backbone, are used by a well-designed Convolutional Neural Networks (CNN) to compute direction maps. The local features are extracted from the whole features through RoIAlign, based on the region provided by detection network. Experiments on LINEMOD dataset show that RPVNet’s average accuracy (86.1%) is almost equal to state-of-the-art (86.4%) when no occlusion occurs. Meanwhile, results on Occlusion LINEMOD dataset show that RPVNet outperforms state-of-the-art (43.7% vs. 40.8%) and is more accurate for small object in occluded scenes.

Highlights

Estimating the accurate 6D-Pose of certain objects has important implications in industries such as e-commerce and logistics [1], as pose information helps robotic systems to better manipulate materials
LINEMOD has an average of 1200 images per type of object, but only 150 images of each type are selected for training
40–60% of the area of the target object is visible after cropping. This results in a very high pixel deficiency of the target object. This phenomenon is not easy to occur in practice and this paper is aimed at improving the accuracy of pose estimation when occlusion happens, we still conducted experiments on Truncation LINEMOD as an additional exploration

Summary

Introduction

Estimating the accurate 6D-Pose of certain objects has important implications in industries such as e-commerce and logistics [1], as pose information helps robotic systems to better manipulate materials. Monocular camera is one of the most common used sensors in pose estimation, due to its low cost, rich information and easy installation. Since the 3D coordinates of spatial points cannot be obtained directly by a monocular camera, estimating the 6D-pose of an object from an image is a great challenge in research. The correspondence between the points on the model and the points on the picture is firstly determined. Traditional methods use handcraft features [3,4,5,6] to detect the model’s corresponding keypoints in the image. Typical methods [7] extract prior templates from images of objects taken in various poses and estimates the poses of objects in image by matching these templates. Similar to handcraft-features, such templates are sensitive to environmental change

Objectives

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Jan 14, 2021
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Region Pixel Voting Network (RPVNet) for 6D Pose Estimation from Monocular Image

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

CNN classification based on global and local features
Yufeng Zheng ... Matthias F Carlsohn
-
Yufeng Zheng, et. al.Yufeng Zheng ... Matthias F Carlsohn
14 May 2019
14 May 2019

CTAFFNet: CNN–Transformer Adaptive Feature Fusion Object Detection Algorithm for Complex Traffic Scenarios
Xinlong Dong ... Peicheng Shi
Transportation Research Record: Journal of the Transportation Research Board | VOL. -
Xinlong Dong, et. al.Xinlong Dong ... Peicheng Shi
15 Sep 2024
Transportation Research Record: Journal of the Transportation Research Board | VOL. -

Significance and Limitations of Deep Neural Networks for Image Classification and Object Detection
Shivani Sood ... Harjeet Singh
-
Shivani Sood, et. al.Shivani Sood ... Harjeet Singh
07 Oct 2021
07 Oct 2021

DLNLF-net: Denoised local and non-local deep features fusion network for malignancy characterization of hepatocellular carcinoma
Haoyuan Huang ... Wu Zhou
Computer Methods and Programs in Biomedicine | VOL. 227
Haoyuan Huang, et. al.Haoyuan Huang ... Wu Zhou
25 Oct 2022
Computer Methods and Programs in Biomedicine | VOL. 227

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Region Pixel Voting Network (RPVNet) for 6D Pose Estimation from Monocular Image

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences