RGB-Infrared multi-modal object detection utilizes diverse and complementary information, showing some advantages in intelligent transportation field. The main challenge of RGB-Infrared object detection is how to fuse the two modalities. The difficulty of fusion is reflected in two aspects: 1) large visual differences between modalities make it difficult to learn effective complementary features, 2) some misaligned RGB-Infrared images increase the difficulty of fusion. To this end, based on feature pyramid commonly used in object detection, we propose Multi-modal Feature Pyramid Transformer (MFPT) to fuse the two modalities. The proposed MFPT learns semantic and modal complementary information to enhance each modal features via intra-modal feature pyramid transformer and inter-modal feature pyramid transformer. The intra-modal feature pyramid transformer enables features to interact across space and scales, improving the semantic representations of features in each modality. The inter-modal feature pyramid transformer conducts feature interaction between modalities, enabling each modality to learn complementary features from other modalities. Meanwhile, the inter-modal feature pyramid transformer can also learn distance independent dependencies between modalities, which are not sensitive to misaligned images. Furthermore, a local attention mechanism is introduced within different windows into MFPT to achieve efficient correlation between regions of different scales or different modalities. Experimental results on two RGB-Infrared detection datasets demonstrate the proposed method is superior to state-of-the-art methods.
Read full abstract