Monocular 3D object detection is essential for identifying objects in road images, thus offering valuable environmental perception data that are crucial for human-centric autonomous driving systems. However, due to the inherent limitations of camera imaging, obtaining precise depth information from images alone is challenging, which hampers the accuracy of within-scene object localization. In this paper, we introduce a monocular 3D object detection method called MonoFG that uses knowledge distillation with separated foreground and background components to improve the accuracy of object localization. First, detached foreground and background distillation processes can strategically leverage the distinct positional information acquired from each location to optimize the produced global distillation effects. This step serves as the foundation for the subsequent feature and response distillation process, which focuses on the distilled foreground and background rather than isolated object distillation. Second, triple attention mechanism-based feature distillation intensifies the feature imitation and feature representation capabilities of the student network. Spatial and channel attention mechanisms encourage the student network to capture crucial pixels and channels from the teacher network, whereas a self-attention mechanism globally transfers the learned relationships between pixels. Third, localization error-based response distillation facilitates a clearer transfer of positional information from the teacher network to the student network. Only when the positioning ability of the teacher network exceeds that of the student network can knowledge be comprehensively distilled across both the foreground and background. Therefore, the distillation process is constrained to specific content, which is delineated by positioning errors that serve as the boundaries. Finally, experiments conducted on the KITTI benchmark dataset demonstrate that our method outperforms many well-known baseline methods in several representative evaluation tasks (e.g., 3D object detection and bird's-eye view (BEV) detection) involving human-centric autonomous driving systems.
Read full abstract