Multimodal 3D object detection methods are poorly adapted to real-world traffic scenes due to sparse distribution of point clouds and misalignment multimodal data during actual collection. Among the existing methods, they focus on high-quality open-source datasets, with performance relying on the accurate structural representation of point clouds and the precise mapping relationship between point clouds and images. To solve the above challenges, this paper proposes a multimodal feature-level fusion method based on the bi-directional interaction between image and point cloud. To overcome the sparsity issue in asynchronous multi-modal data, a point cloud densification scheme based on visual guidance and point cloud density guidance is proposed. This scheme can generate object-level virtual point clouds even when the point cloud and image are misaligned. To deal with the unalignment issue between point cloud and image, a bi-directional interaction module based on image-guided interaction with key points of point clouds and point cloud-guided interaction with image context information is proposed. It achieves effective feature fusion even when the point cloud and image are misaligned. The experiments on the VANJEE and KITTI datasets demonstrated the effectiveness of the proposed method, with average precision improvements of 6.20% and 1.54% compared to the baseline.
Read full abstract