Intelligent perception is crucial in Intelligent Transportation Systems (ITS), with vision cameras as critical components. However, traditional RGB cameras exhibit a significant decline in performance when capturing nighttime traffic scenes, limiting their effectiveness in supporting ITS. In contrast, event cameras possess a high dynamic range (140 dB vs. 60 dB for traditional cameras), enabling them to overcome frame degradation in low-light conditions. Recently, multimodal learning paradigms have made substantial progress in various vision tasks, such as image-text retrieval. Motivated by this progress, we propose an adaptive selection and fusion detection method that leverages both event and RGB frame domains to optimize nighttime traffic object detection jointly. To address the challenge of unbalanced multimodal data fusion, we design a learnable adaptive selection and fusion module. This module performs feature ranking and fusion in the channel dimension, allowing efficient multimodal fusion. Additionally, we construct a novel multi-level feature pyramid network based on multimodal attention fusion. This network extracts potential features to enhance robustness in detecting nighttime traffic objects. Furthermore, we curate a dataset for nighttime traffic scenarios comprising RGB frames and corresponding event streams. Through experiments, we demonstrate that our proposed method outperforms current state-of-the-art techniques in event-based, frame-based, and event and frame fusion methods. This highlights the effectiveness of integrating the event and frame domains in enhancing nighttime traffic object detection.
Read full abstract