Abstract Three-dimensional (3D) occupancy perception technology aims to enable autonomous vehicles to observe and understand dense 3D environments. Estimating the complete geometry and semantics of a scene solely from visual images is challenging. However, humans can easily conceive the complete form of objects based on partial key information and their own experience. This ability is crucial for recognizing and interpreting the surrounding environment. To equip 3D occupancy perception systems with a similar capability, a 3D semantic scene completion method called AEFF-SSC is proposed. This method deeply explores boundary and multi-scale information in voxels, aiming to reconstruct 3D geometry more accurately. We have specifically designed an attention-enhanced feature fusion module that effectively fuses image feature information from different scales and focuses on feature boundary information, thereby more efficiently extracting voxel features. Additionally, we introduce a semantic segmentation module driven by a 3D attention-UNet network. This module combines a 3D U-Net network with a 3D attention mechanism. Through feature fusion and feature weighting, it aids in restoring 3D spatial information and significantly improves the accuracy of segmentation results. Experimental verification on the SemanticKITTI dataset demonstrates that AEFF-SSC significantly outperforms other existing methods in terms of both geometry and semantics. Specifically, within a 12.8 m × 12.8 m area ahead, our geometric occupancy accuracy has achieved a significant improvement of 71.58%, and at the same time, the semantic segmentation accuracy has also increased remarkably by 54.20%.
Read full abstract