Multimodal Cross-Layer Bilinear Pooling for RGBT Tracking

Qin Xu,Yiming Mei,Chenglong Li,Jinpei Liu

doi:10.1109/tmm.2021.3055362

Abstract

Hierarchical deep features can provide multilevel abstractions of target objects, which play an important role in target localization and classification. Determining how to effectively aggregate abstract information from different levels in RGB and thermal modalities is the key to exploiting their complementary advantages for robust RGBT tracking. However, existing RGBT tracking algorithms either focus on the semantic information of the last layer or aggregate hierarchical deep features from each modal using simple operations (e.g., summation and concatenation), which limit the capability of the multimodal tracker. To address these issues, in this paper, we propose a novel multimodal cross-layer bilinear pooling network for RGBT tracking. In our network, firstly, to boost the performance of the tracker, we use a channel attention mechanism to implement the adaptive calibration of feature channels for all convolutional layer features before realizing hierarchical feature fusion. Then, a bilinear pooling operation is performed on any two layers through the cross product, which is a second-order computation that effectively aggregates the deep semantic and shallow texture information of the target. Finally, a quality-aware fusion module is designed to aggregate the bilinear pooling features of different layer interactions between different modalities in an adaptive manner. The results of a large number of experiments on two public benchmark datasets demonstrate the effectiveness of our tracker compared with other state-of-the-art tracking methods.

Full Text