Abstract

This paper presents 6D vision transformer (6D-ViT), a transformer-based instance representation learning network suitable for highly accurate category-level object pose estimation based on RGB-D images. Specifically, a novel two-stream encoder-decoder framework is dedicated to exploring complex and powerful instance representations from RGB images, point clouds, and categorical shape priors. The whole framework consists of two main branches, named Pixelformer and Pointformer. Pixelformer contains a pyramid transformer encoder with an all-multilayer perceptron (MLP) decoder to extract pixelwise appearance representations from RGB images, while Pointformer relies on a cascaded transformer encoder and an all-MLP decoder to acquire the pointwise geometric characteristics from point clouds. Then, dense instance representations (i.e., correspondence matrix and deformation field) for NOCS model reconstruction are obtained from a multisource aggregation (MSA) network with shape prior, appearance and geometric information as inputs. Finally, the instance 6D pose is computed by solving the similarity transformation between the observed point clouds and the reconstructed NOCS representations. Extensive experiments with synthetic and real-world datasets demonstrate that the proposed framework achieves state-of-the-art performance for both datasets. Code is available at https://github.com/luzzou/6D-ViT.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call