Abstract

The task of estimating the 6D pose of object instances in the crowd (scenes with multiple object instances, severe foreground occlusions and background distractors), has been a research hotspot in recent years since it is very common in industrial applications. In this work, we present a segmentation-driven approach to recover the 6D object pose in the crowd. Firstly, a convolution neural network framework Mask R-CNN is applied to segment masks and bounding boxes of target object instances from the scene image in this stage. Then, the bounding boxes are segmented into smaller patches with slide windows. After that, a Sparse Auto Encoder is employed to extract invariant features of these patches, and we can obtain several candidate rough poses by Hough Voting. Finally, Iterative Closest Point (ICP) method is used to refine the 6D object pose for a better result. We tested our approach on the commonly used LINEMOD dataset [1]. Experimental results show that our approach achieves high accuracy and robustness under foreground occlusions and background distractors.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call