Abstract

Multi-view 3D visual perception including 3D object detection and Birds'-eye-view (BEV) map segmentation is essential for autonomous driving. However, there has been little discussion about 3D context attention between dynamic objects and static elements with multi-view camera inputs, due to the challenging nature of recovering the 3D spatial information from images and performing effective 3D context interaction. 3D context information is expected to provide more cues to enhance 3D visual perception for autonomous driving. We thus propose a new transformer-based framework named CI3D in an attempt to implicitly model 3D context interaction between dynamic objects and static map elements. To achieve this, we use dynamic object queries and static map queries to gather information from multi-view image features, which are represented sparsely in 3D space. Moreover, a dynamic 3D position encoder is utilized to precisely generate queries' positional embeddings. With accurate positional embeddings, the queries effectively aggregate 3D context information via a multi-head attention mechanism to model 3D context interaction. We further reveal that sparse supervision signals from the limited number of queries result in the issue of rough and vague image features. To overcome this challenge, we introduce a panoptic segmentation head as an auxiliary task and a 3D-to-2D deformable cross-attention module, greatly enhancing the robustness of spatial feature learning and sampling. Our approach has been extensively evaluated on two large-scale datasets, nuScenes and Waymo, and significantly outperforms the baseline method on both benchmarks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call