3D Object pose estimation is a critical task in many real-world applications, e.g., robotic manipulation and augmented reality. Most existing methods focus on estimating the object instances or categories which have been seen in the training phase. However, it is imperative to estimate the pose of unseen objects without re-training the network in real world. Therefore, we proposed a 3D pose estimation method for unseen objects without re-training. Specifically, given the CAD model of the unseen object, a set of template RGB-D images (RGB images and depth images) is rendered at different viewpoints. Then a feature embedding network, named PoseFusion, is designed to extract the scene feature. In this network, RGB-D images are utilized to extract the texture feature and geometric feature, respectively. Afterwards, a cross-modality alignment module is proposed to eliminate the noise in single modality. The aligned texture feature and aligned geometric feature are fused through a geometry guided fusion module. Thus, by PoseFusion, the template RGB-D images generated from the CAD model are abstracted into a set of template scene features, and the query scene features are also embedded from the captured RGB-D images from the unseen object. Finally, the query scene features are matched with the template scene features by calculating the masked local similarity. Then the identity and pose of unseen object are determined by the most similar template. Experiments on LINEMOD and T-LESS datasets demonstrate that our method outperforms other methods and generalizes better to unseen objects. Extensive ablation studies are performed to verify the effectiveness of the PoseFusion.
Read full abstract