Unified Transformer With Isomorphic Branches for Natural Language Tracking

Rong Wang,Si Liu,Tianrui Hui,Xiaoqian Liu,Qianli Zhou,Quange Tan,Zongheng Tang

doi:10.1109/tcsvt.2023.3288353

Abstract

Natural language tracking aims to localize the target object referred to by a language description using a sequence of bounding boxes in video frames. Compared with traditional visual single object tracking initialized with only a bounding box (BBox), this task introduces high-level semantic information to reduce the ambiguity of BBox and enhance the ability to retrieve the target in a global manner. Thus, it can yield more accurate and robust tracking results. Previous methods usually adopt off-the-shelf grounding and tracking branches to tackle this task, where feature representations are learned in isolation without benefiting each other. The two branches can associate with each other to discover crucial clues since the language description and the template image provide information from different sources. Therefore, we propose a unified transformer method for natural language tracking named TransNLT, which utilizes isomorphic Transformer structures for grounding and tracking branches where collaborative learning is enabled to construct comprehensive features for the target. In addition, we propose a Selective Feature Gathering (SFG) module which can integrate cross-modal global information of the tracking target from the visual template and language description. Through effective interaction of visual and language information, we can achieve better results than tracking with only a single modality. Extensive experiments on three popular natural language tracking benchmarks show our proposed TransNLT outperforms previous state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Unified Transformer With Isomorphic Branches for Natural Language Tracking

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology

Lead the way for us

Similar Papers

Joint visual template and natural language for robust visual tracking
Jingchao Wang ... Huanlong Zhang
Electronics Letters | VOL. 58
Jingchao Wang, et. al.Jingchao Wang ... Huanlong Zhang
06 Sep 2022
Electronics Letters | VOL. 58

Person Re-identification and Tracking in Video Surveillance

-

16 Jun 2020
16 Jun 2020

MSSTResNet-TLD: A robust tracking method based on tracking-learning-detection framework by using multi-scale spatio-temporal residual network feature model
Bing Liu ... Yong Yang
Neurocomputing | VOL. 362
Bing Liu, et. al.Bing Liu ... Yong Yang
19 Jul 2019
Neurocomputing | VOL. 362

Multi-Person Pose Estimation Using Bounding Box Constraint and LSTM
Miaopeng Li ... Xinguo Liu
IEEE Transactions on Multimedia | VOL. 21
Miaopeng Li, et. al.Miaopeng Li ... Xinguo Liu
01 Oct 2019
IEEE Transactions on Multimedia | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Unified Transformer With Isomorphic Branches for Natural Language Tracking

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology