Abstract

ABSTRACT Fusing complementary information of visible and infrared radiation modalities can improve object detection performance for unmanned aerial vehicle (UAV) remote sensing images under insufficient illumination conditions. Although previous works have conducted some studies in this field, they have rarely considered the adaptive ability of multimodal feature fusion, which limits the performance improvement space for multimodal detectors. To this end, we propose an adaptive multimodal feature fusion method with a frequency domain gate based on DINO (detection transformer with improved denoising anchor boxes), called multimodal DINO. In our approach, a multimodal feature encoder with underlying feature sharing is designed, which efficiently extracts common and differential features through RGB-guided infrared radiation data transformation. Additionally, an adaptive frequency domain gate is introduced to dynamically learn the degree of dependence on frequency-filtered features of each modality when processing different samples. We evaluate the proposed method on the two multimodal detection remote sensing image datasets, VEDAI and DroneVehicle. Extensive experiments demonstrate that our approach achieves superior performance compared to basic detectors and existing multimodal detection methods. Our code is available at https://github.com/cq100/multimodalDINO.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call