This paper addresses the challenge of 6DoF texture-less object pose estimation from a single RGB image. Many recent works have shown that two-stage deep learning approaches based on the fusion of 2D geometric intermediate representations achieve remarkable results. These methods implicitly explore the mapping from the 2D appearance domain to the 3D structure domain. However, due to the lack of 3D geometric constraints from depth maps, it is difficult to extract enough clues based on appearance features to master the geometric relation of projection from 3D viewpoints to 2D planes, and this estimation process is extremely sensitive to occlusion. We propose a novel network called MLFNet that lifts the feature space from 2D to 3D based on hybrid 3D geometric intermediate representations. For the first time, we propose the surface normals in the object coordinate system as an intermediate representation of pose; its violent change provides strong clues for the keypoints usually located at the abrupt change of object surface. Dense 3D surfaces can enhance the geometric consistency of multi-representation constraints and retain more information in occluded scenes. With the proposed multi-modality dual attention mechanism and the embedding of standard 3D shape knowledge, the 2D geometric representation learning process explicitly depends on the fusion of 2D appearance features and 3D geometric features. This standardized information fusion pattern among 2D intermediate representations, 3D intermediate representations, and CAD models prior significantly reduces the network learning space. The proposed method achieves competitive performance on the Linemod dataset and outperforms the state-of-the-art methods on the Occlusion Linemod and T-Less datasets, which demonstrates the feasibility of the pose multi-representation fusion technique. The project site is at https://github.com/JJJano/MLFNet.
Read full abstract