Abstract

Natural language tracking aims to localize the target object referred to by a language description using a sequence of bounding boxes in video frames. Compared with traditional visual single object tracking initialized with only a bounding box (BBox), this task introduces high-level semantic information to reduce the ambiguity of BBox and enhance the ability to retrieve the target in a global manner. Thus, it can yield more accurate and robust tracking results. Previous methods usually adopt off-the-shelf grounding and tracking branches to tackle this task, where feature representations are learned in isolation without benefiting each other. The two branches can associate with each other to discover crucial clues since the language description and the template image provide information from different sources. Therefore, we propose a unified transformer method for natural language tracking named TransNLT, which utilizes isomorphic Transformer structures for grounding and tracking branches where collaborative learning is enabled to construct comprehensive features for the target. In addition, we propose a Selective Feature Gathering (SFG) module which can integrate cross-modal global information of the tracking target from the visual template and language description. Through effective interaction of visual and language information, we can achieve better results than tracking with only a single modality. Extensive experiments on three popular natural language tracking benchmarks show our proposed TransNLT outperforms previous state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call