The key issue of thermal infrared tracking is to use neural networks to represent the target effectively and efficiently in the thermal infrared domain. The lack of thermal infrared trainable datasets makes it difficult to train a robust infrared object tracker from scratch, and the time-consuming convolution operations also make the tracking slow. To address the above problems, we proposed cross-modal compression distillation to represent thermal infrared objects for tracking, by leveraging an off-the-shelf RGB model with knowledge distillation. Specifically, cross-modal distillation is performed to effectively transfer knowledge from RGB modality to thermal infrared modality by inputting paired RGB and thermal infrared images into two branches of a Siamese network. Additionally, based on the teacher–student model architecture, the feature extractor is compressed into a lightweight model by model pruning and multi-level deep feature matching. Experimental results on LSOTB-TIR and PTB-TIR datasets show that the thermal infrared object tracking models distilled by our proposed method achieved faster tracking speed with better performance than the baseline RGB tracker by gaining an improvement of 1.5% Success Rate, 2.2% Precision, and 1.9% Normalized Precision, 58 frames per second (FPS) on LSOTB-TIR dataset, respectively.
Read full abstract