Abstract
Siamese tracking is one of the most promising object tracking methods today due to its balance of performance and speed. However, it still performs poorly when faced with some challenges such as low light or extreme weather. This is caused by the inherent limitations of visible images, and a common way to cope with it is to introduce infrared data as an aid to improve the robustness of tracking. However, most of the existing RGBT trackers are variants of MDNet (Hyeonseob Nam and Bohyung Han, Learning multi-domain convolutional neural networks for visual tracking, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4293–4302.), which have significant limitations in terms of operational efficiency. On the contrary, the potential of Siamese tracking in the field of RGBT tracking has not been effectively exploited due to the reliance on large-scale training data. To solve this dilemma, in this paper, we propose an end-to-end Siamese RGBT tracking framework that is based on cross-modal feature enhancement and self-attention (SiamFEA). We draw on the idea of migration learning and employ local fine-tuning to reduce the dependence on large-scale RGBT data and verify the feasibility of this approach, and then we propose a reliable fusion approach to efficiently fuse the features of different modalities. Specifically, we first propose a cross-modal feature enhancement module to exploit the complementary properties of dual-modality, followed by capturing non-local attention in channel and spatial dimensions for adaptive weighted fusion, respectively. Our network was trained end-to-end on the LasHeR (Chenglong Li, Wanlin Xue, Yaqing Jia, Zhichen Qu, Bin Luo, Jin Tang, LasHeR: A Large-scale High-diversity Benchmark for RGBT Tracking, CoRR abs/2104.13202, 2021) training set and reached new SOTAs on GTOT (C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, L. Lin, Learning collaborative sparse representation for grayscale-thermal tracking, IEEE Trans. Image Process, 25 (12) (2016) 5743–5756.), RGBT234 (C. Li, X. Liang, Y. Lu, N. Zhao, and J. Tang, “Rgb-t object tracking: Benchmark and baseline,” Pattern Recognition, vol. 96, p. 106977, 2019.), and LasHeR (Chenglong Li, Wanlin Xue, Yaqing Jia, Zhichen Qu, Bin Luo, Jin Tang, LasHeR: A Large-scale High-diversity Benchmark for RGBT Tracking, CoRR abs/2104.13202, 2021) while running in real-time.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Journal of Visual Communication and Image Representation
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.