Abstract

6D object pose estimation from RGB-D images has achieved excellent performance in recent years. Since RGB-D images contain both RGB data and depth data, how to learn a comprehensive representation from these two modalities is an obstacle to achieving accurate pose estimation. Many existing works integrate RGB and depth information through either simple concatenation, or element-wise multiplication at the pixel level or feature level, ignoring the interaction between these two modalities. In order to address this problem, in this paper, we adopt the self-attention mechanism to model the relationship between different modalities, and propose a mutual attention fusion (MAF) block to interact the features in the two modalities, thereby producing a concise and robust RGB-D representation. Comprehensive experiments on the LineMOD and YCB-Video datasets demonstrate that the proposed approach achieves superior performance over previous works, yet remains efficient and easy to use.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call