Abstract
Transformer-based architectures have recently shown impressive performance on various point cloud understanding tasks such as 3D object shape classification and semantic segmentation. Particularly, this can be attributed to their self-attention mechanism, which has the ability to capture long-range dependencies. However, current methods have constrained it to operate in local patches due to its quadratic memory constraints. This hinders their generalization ability and scaling capacity due to the loss of non-locality in early layers. To tackle this issue, we propose a window-based transformer architecture that captures long-range dependencies while aggregating information in the local patches. We do this by interacting each window with a set of global point cloud tokens — a representative subset of the entire scene — and augmenting the local geometry through a 3D Histogram of Oriented Gradients (HOG) descriptor. Through a series of experiments on segmentation and classification tasks, we show that our model exceeds the state-of-the-art on S3DIS semantic segmentation (+1.67% mIoU), ShapeNetPart part segmentation (+1.03% instance mIoU) and performs competitively on ScanObjectNN 3D object classification.11The code and trained models shall be made publicly available.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.