RCFusion: Fusing 4-D Radar and Camera With Bird’s-Eye View Features for 3-D Object Detection

Lianqing Zheng,Zhixiong Ma,Sihan Chen,Long Yang,Bin Tan,Xichan Zhu,Libo Huang,Sen Li,Jie Bai

doi:10.1109/tim.2023.3280525

Abstract

Camera and millimeter-wave (MMW) radar fusion is essential for accurate and robust autonomous driving systems. With the advancement of radar technology, next-generation high-resolution automotive radar, i.e., 4D radar, has emerged. In addition to the target range, azimuth, and Doppler velocity measurements of traditional radar, 4D radar provides elevation measurement to create a denser “point cloud.” In this study, we propose a camera and 4D radar fusion network called RCFusion, which achieves multimodal feature fusion under a unified bird’s-eye view (BEV) space to accomplish 3D object detection tasks. In the camera stream, multi-scale feature maps are obtained by the image backbone and feature pyramid network; they are then converted into orthographic feature maps by an orthographic feature transform. Next, enhanced and fine-grained image BEV features are obtained via a designed shared attention encoder. Meanwhile, in the 4D radar stream, a newly designed component named Radar PillarNet efficiently encodes the radar features to generate radar pseudo-images, which are fed into the point cloud backbone to create radar BEV features. An interactive attention module is proposed for the fusion stage, which outputs a valid fusion of the two-modal BEV features. Finally, a generic detection head predicts the object classes and locations. The proposed RCFusion is validated on the TJ4DRadSet and View-of-Delft datasets. The experimental results and analysis show that the proposed method can effectively fuse camera and 4D radar features to achieve robust detection performance.

Full Text