6D Object Pose Estimation with Mutual Attention Fusion

Lu Zou,Zhangjin Huang,Naijie Gu

doi:10.1007/978-3-030-87358-5_24

Abstract

6D object pose estimation from RGB-D images has achieved excellent performance in recent years. Since RGB-D images contain both RGB data and depth data, how to learn a comprehensive representation from these two modalities is an obstacle to achieving accurate pose estimation. Many existing works integrate RGB and depth information through either simple concatenation, or element-wise multiplication at the pixel level or feature level, ignoring the interaction between these two modalities. In order to address this problem, in this paper, we adopt the self-attention mechanism to model the relationship between different modalities, and propose a mutual attention fusion (MAF) block to interact the features in the two modalities, thereby producing a concise and robust RGB-D representation. Comprehensive experiments on the LineMOD and YCB-Video datasets demonstrate that the proposed approach achieves superior performance over previous works, yet remains efficient and easy to use.

Full Text