Abstract

Effective and efficient 3D semantic segmentation from large-scale LiDAR point cloud is a fundamental problem in the field of autonomous driving. In this paper, we present Transformer-Range-View Network (TransRVNet), a novel and powerful projection-based CNN-Transformer architecture to infer point-wise semantics. First, a Multi Residual Channel Interaction Attention Module (MRCIAM) is introduced to capture channel-level multi-scale feature and model intra-channel, inter-channel correlations based on attention mechanism. Then, in the encoder stage, we use a well-designed Residual Context Aggregation Module (RCAM), including a residual dilated convolution structure and a context aggregation module, to fuse information from different receptive fields while reducing the impact of missing points. Finally, a Balanced Non-square-Transformer Module (BNTM) is employed as fundamental component of decoder to achieve locally feature dependencies for more discriminative feature learning by introducing the non-square shifted window strategy. Extensive qualitative and quantitative experiments conducted on challenging SemanticKITTI and SemanticPOSS benchmarks have verified the effectiveness of our proposed technique. Our TransRVNet presents superior performance over most existing state-of-the-art approaches. The source code and trained model are available at https://github.com/huixiancheng/TransRVNet.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call