Environmental perception is crucial for safe autonomous driving, enabling accurate analysis of the vehicle’s surroundings. While 3D LiDAR is traditionally used for 3D environment reconstruction, its high cost and complexity present challenges. In contrast, camera-based cross-view frameworks can offer a cost-effective alternative. Hence, this manuscript proposes a new cross-view model to extract mapping features from camera images and then transfer them to a Bird’s-Eye View (BEV) map. Particularly, a multi-head attention mechanism in the decoder architecture generates the final semantic map. Each camera learns embedding information corresponding to its position and angle within the BEV map. Cross-view attention fuses information from different perspectives to predict top-down map features enriched with spatial information. The multi-head attention mechanism then globally performs dependency matches, enhancing long-range information and capturing latent relationships between features. Transposed convolution replaces traditional upsampling methods, avoiding high similarities of local features and facilitating semantic segmentation inference of the BEV map. Finally, we conduct numerous simulation experiments to verify the performance of our cross-view model.
Read full abstract