Integrating complementary visual information from multimodal image pairs can significantly improve the robustness and accuracy of object detection algorithms, particularly in challenging environments. However, a key challenge lies in the effective fusion of modality-specific features within these algorithms. To address this, we propose a novel lightweight fusion module, termed the Coordinate Attention Fusion (CAF) module, built on the YOLOv5 object detection framework. The CAF module exploits differential amplification and coordinated attention mechanisms to selectively enhance distinctive cross-modal features, thereby preserving critical modality-specific information. To further optimize performance and reduce computational overhead, the two-stream backbone network has been refined, reducing the model's parameter count without compromising accuracy. Comprehensive experiments conducted on two benchmark multimodal datasets demonstrate that the proposed approach consistently surpasses conventional methods and outperforms existing state-of-the-art multimodal object detection algorithms. These findings underscore the potential of cross-modality fusion as a promising direction for improving object detection in adverse conditions.
Read full abstract