CMA: Cross-modal attention for 6D object pose estimation

Lu Zou,Zhangjin Huang,Fangjun Wang,Zhouwang Yang,Guoping Wang

doi:10.1016/j.cag.2021.04.018

Abstract

Deep learning methods for 6D object pose estimation based on RGB and depth (RGB-D) images have been successfully applied to robotic manipulation and grasping. Among these approaches, the fusion of RGB and depth modalities is one of the most critical issues. Most existing works performed fusion via either simple concatenation, or element-wise multiplication of the features generated by these two modalities. Despite achieving impressive progress, such fusion strategies do not explicitly consider the different contributions of RGB and depth modalities, leaving a gap for performance enhancement. In this paper, we present a Cross-Modal Attention (CMA) component for the problem of 6D object pose estimation. With the attention mechanism, features of two different modalities are aggregated adaptively through the attention weights, such that powerful representations from the RGB-D images can be efficiently extracted. Comprehensive experiments on both LINEMOD and YCB-Video datasets demonstrate that the proposed approach achieves state-of-the-art performance.

Full Text