Abstract
Natural language tracking aims to localize the target object referred to by a language description using a sequence of bounding boxes in video frames. Compared with traditional visual single object tracking initialized with only a bounding box (BBox), this task introduces high-level semantic information to reduce the ambiguity of BBox and enhance the ability to retrieve the target in a global manner. Thus, it can yield more accurate and robust tracking results. Previous methods usually adopt off-the-shelf grounding and tracking branches to tackle this task, where feature representations are learned in isolation without benefiting each other. The two branches can associate with each other to discover crucial clues since the language description and the template image provide information from different sources. Therefore, we propose a unified transformer method for natural language tracking named TransNLT, which utilizes isomorphic Transformer structures for grounding and tracking branches where collaborative learning is enabled to construct comprehensive features for the target. In addition, we propose a Selective Feature Gathering (SFG) module which can integrate cross-modal global information of the tracking target from the visual template and language description. Through effective interaction of visual and language information, we can achieve better results than tracking with only a single modality. Extensive experiments on three popular natural language tracking benchmarks show our proposed TransNLT outperforms previous state-of-the-art methods.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Circuits and Systems for Video Technology
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.