Abstract

Convolutional Occupancy Networks (ConvONet) have gained popularity in object-level and scene-level reconstruction. However, how to better represent the 3D features for ConvONet remains an open question. In this paper, we propose to improve the representation for ConvONet by enhancing both 3D positional information and 3D-2D correlations. Considering that position information acts as the fundamental component of a 3D shape, we propose a Position-Aware Transformer (PAT) architecture that incorporates the Adaptive Multi-Scale Position Encoding (AMSPE) into the self-attention computation. By leveraging both global and local position aggregations in a multi-level manner, AMSPE enables better representations of both coarse and fine structures of the 3D shape. Meanwhile, since projecting 3D features to 2D planes for convolution inevitably introduces ambiguous or noisy representations, we propose a 3D Correlation-Guided Enhancement (CGE) network to bridge the gap between 3D and 2D shape representations. Specifically, we leverage the projected 3D correlations from PAT as the structural guidance, then compute the 3D Correlation-Guided Attentions (CGAs) to enhance the most representative features in the 2D space. In this way, the proposed architecture preserves the most informative structural representations while alleviating the impact of the mis-projected and noisy features. Experiments on ShapeNet and indoor scene dataset demonstrate the superiority of our method. Both quantitative and qualitative experiments show that our method achieves state-of-the-art performance for implicit-based 3D reconstruction.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call