Abstract

The recently developed pure transformer architectures have attained promising accuracy on point cloud learning benchmarks compared to convolutional neural networks. However, existing point cloud Transformers are computationally expensive because they waste a significant amount of time on structuring irregular data. To solve this shortcoming, we present the Sparse Window Attention module to gather coarse-grained local features from nonempty voxels. The module not only bypasses the expensive irregular data structuring and invalid empty voxel computation, but also obtains linear computational complexity with respect to voxel resolution. Meanwhile, we leverage two different self-attention variants to gather fine-grained features about the global shape according to different scale of point clouds. Finally, we construct our neural architecture called point-voxel transformer (PVT), which integrates these modules into a joint framework for point cloud learning. Compared with previous transformer-based and attention-based models, our method attains a top accuracy of 94.1% on the classification benchmark and 10 × $10\times $ inference speedup on average. Extensive experiments also validate the effectiveness of PVT on semantic segmentation benchmarks. Our code and pretrained model are avaliable at https://github.com/HaochengWan/PVT.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call