Learning modality feature fusion via transformer for RGBT-tracking

Yujue Cai,Xiubao Sui,Guohua Gu,Qian Chen

doi:10.1016/j.infrared.2023.104819

Abstract

RGB-T tracking can be seen as multi-view fusion tracking, and in this study, we propose a network with transformer structure, Multi-Modal Mutual Propagation Tracker (MMMPT). In order to obtain robust appearance model from multi-modal data, we adopt encoder–decoder architecture for extract information. In the encoding stage, the template features of multiple frames enhance the common features across them through the self-attention mechanism to obtain time-invariant target representation. At the same time, it also interacts with multi-modal data through cross-modal propagation, resulting in a modal-invariant representation of the target. The transformer decoder transfers useful information from the template to search areas through a similarity matrix. We experiment on the RGBT234, GTOT, VTUAV and LasHeR datasets to assess the RGBT-transformer tracker. Extensive experiments indicate that our proposed framework is not inferior to the state-of-the-art trackers in terms of robustness and accuracy.

Full Text