Abstract

AbstractWe propose a novel framework to learn 3D point cloud semantics from 2D multi-view image observations containing pose errors. Normally, LiDAR point cloud and RGB images are captured in standard automated-driving datasets. This motivates us to conduct a “task transfer” paradigm so that 3D semantic segmentation benefits from aggregating 2D semantic cues. However, pose noises are contained in 2D image observations and erroneous prediction from 2D semantic segmentation renders the “task transfer” difficult. To consider those two factors, we perceive each 3D point using multi-view images and for every single image, a patch observation is employed. Moreover, the semantic labels of a block of neighboring 3D points are predicted simultaneously, enabling us to exploit the point structure prior. A hierarchical full attention network (HiFANet) is designed to sequentially aggregate patch, bag-of-frames and inter-point semantic cues. The hierarchical attention mechanism is tailored for different levels of semantic cues. Each preceding attention block largely reduces the feature size before feeding to the next attention block, making our framework slim. Experiment results on Semantic-KITTI show that the proposed framework outperforms existing 3D point cloud based methods significantly, requiring less training data and exhibiting tolerance to pose noise. The code is available at https://github.com/yuhanghe01/HiFANet.KeywordsAttention networkSemantic segmentationPose noise

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call