Abstract

Due to the limitations of visible (RGB) sensors in challenging scenarios, such as nighttime and foggy environments, the thermal infrared (TIR) modality draws increasing attention as an auxiliary source for robust tracking systems. Currently, the existing methods extract both the RGB and TIR clues in a similar approach, i.e. , utilising RGB-pretrained models with or without finetuning, and then aggregate the multi-modal information through a fusion block embedded in a single level. However, the different imaging principles of RGB and TIR data raise questions about the suitability of RGB-pretrained models for thermal data. In this paper, it is argued that the modality gap is overlooked, and an alternative training paradigm is proposed for TIR data to ensure consistency between the training and test data, which is achieved by optimising the TIR feature extractor with only TIR data involved. Furthermore, with the goal of making better use of the enhanced thermal representations, a multi-level fusion strategy is inspired by the observation that various fusion strategies at different levels can contribute to a better performance. Specifically, fusion modules at both the feature and decision levels are derived for a comprehensive fusion procedure while the pixel-level fusion strategy is not considered due to the misalignment of multi-modal image pairs. The effectiveness of our method is demonstrated by extensive qualitative and quantitative experiments conducted on several challenging benchmarks. Code will be released at https://github.com/Zhangyong-Tang/MELT .

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.