Abstract

The success of transformer networks in the natural language processing and 2D vision domains has encouraged the adaptation of transformers to 3D computer vision tasks. However, most of the existing approaches employ standard backpropagation (SBP). SBP requires the storage of model activations on a forward pass for use during the backward pass, making their memory complexity linearly proportional to model depth, hence, inefficient. Furthermore, most 3D point transformers use the classic QKV matrix multiplication design which comes with a memory bottleneck. To address these issues, we propose a memory-efficient point transformer that makes use of reversible functions and linearized self-attention to minimize SBP and transformer memory complexities, respectively. Additionally, rather than the usual UNet architectural design for segmentation, we adopt a ∇-shaped design to capture multi-size/resolution feature representations towards a finely detailed segmentation. Experimental results on benchmark datasets (Toronto3D, DALES, and CSPC) from different sensor platforms (vehicle, aerial, and backpack) show that our approach uses less than half the number of model parameters compared to its SBP counterpart. It also takes more than twice the input sequence and uses less than half the memory compared to most of the traditional approaches. Additionally, our use of a ∇-shaped architectural design yielded about a 5% increment in model performance over the U-shaped approach. Overall, the proposed PReFormer attained competitive performance compared to the state-of-the-art.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call