MINet: Modality interaction network for unified multi-modal tracking

Shuang Gong,Zhu Teng,Rui Li,Jack Fan,Baopeng Zhang,Jianping Fan

doi:10.1016/j.imavis.2024.105071

Abstract

While RGB-based trackers have made impressive progress, they still falter in complex scenarios, necessitating the exploration of multi-modal tracking strategies that leverage auxiliary modalities. However, most existing methods lack sufficient exploration and interaction of complementary information within and between modalities. To address this, we propose a Modality Interaction Network (MINet), a unified framework for multi-modal tracking. It consists of a Modality Representation Module (MRM) and a Memory Query Module (MQM). MRM enforces communications between different modalities by a designed Modality Interaction module (MIM) and fuses multi-modal information by a Modality Fuse Module (MFM) to generate more discriminative representation. MQM maintains historical multi-modal information and builds long-range dependencies between current and historical targets for tracking, which enhances the tracking performance, especially when targets undergo significant deformation and occlusions. To verify efficiency across different multi-modal tracking paradigms, we conduct extensive experiments, including RGB-D, RGB-T, and RGB-E. The experimental results demonstrate that in these multi-modal tracking tasks, the proposed MINet achieves outstanding performance compared to state-of-the-art trackers. Specifically, it outperforms them by 1% in RGB-D, 1.2% in RGB-T, and 1% in RGB-E tracking performance, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

MINet: Modality interaction network for unified multi-modal tracking

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing

Lead the way for us

Similar Papers

Testing the functionality and contact error of a GPS-based wildlife tracking network
Melanie J Davis ... Xinyu Xing
Wildlife Society Bulletin | VOL. 37
Melanie J Davis, et. al.Melanie J Davis ... Xinyu Xing
18 Jun 2013
Wildlife Society Bulletin | VOL. 37

Learning to estimate 3D interactive two-hand poses with attention perception
Wai Keung Wong ... Lunke Fei
Image and Vision Computing | VOL. -
Wai Keung Wong, et. al.Wai Keung Wong ... Lunke Fei
01 Dec 2024
Image and Vision Computing | VOL. -

Machine learning applications in breast cancer prediction using mammography
G.M Harshvardhan ... Lambros Athanasiou
Image and Vision Computing | VOL. 152
G.M Harshvardhan, et. al.G.M Harshvardhan ... Lambros Athanasiou
01 Dec 2024
Image and Vision Computing | VOL. 152

Unmasking deepfakes: Eye blink pattern analysis using a hybrid LSTM and MLP-CNN model
Ruchika Sharma ... Rudresh Dwivedi
Image and Vision Computing | VOL. -
Ruchika Sharma, et. al.Ruchika Sharma ... Rudresh Dwivedi
01 Dec 2024
Image and Vision Computing | VOL. -

Understanding document images by introducing explicit semantic information and short-range information interaction
Yufeng Cheng ... Tao Deng
Image and Vision Computing | VOL. -
Yufeng Cheng, et. al.Yufeng Cheng ... Tao Deng
01 Dec 2024
Image and Vision Computing | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MINet: Modality interaction network for unified multi-modal tracking

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing