Abstract

Robust 3D detection based on camera-LiDAR fusion has become a research focus for autonomous driving. Despite the extensive efforts invested, it remains an extremely challenging task to effectively establish fine-grained corresponding relations between different modalities and achieve complementary multimodal fusion free of interference. To these ends, we propose a multimodal 3D detector termed as CFFNet, mainly consisting of a multimodal feature extraction backbone network and a corresponding feature-based attention fusion (CFAF) module. Specifically, the multimodal feature extraction backbone is devised to build a simple yet effective cross-modal correspondence by unifying multimodal features into a shared bird’s eye view (BEV) space. Expressive and noise-suppressed image BEV features are generated from pseudo image voxels, whose depth information is estimated with range-wise attention drawn from LiDAR BEV features. With the prominent heterogeneity of multimodal features taken into consideration, the CFAF module performs a discriminative fusion targeted at informative and contributive features retained in different modalities. The fusion is accomplished by element-wisely reweighting image BEV features with attention generated from differential features of LiDAR. The multimodal feature extraction backbone and CFAF module are both evaluated and proved to offer significant performance gains separately and together. Evaluations conducted on the KITTI 3D object detection dataset show that the proposed CFFNet achieves state-of-the-art performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call