Abstract
Tracking pedestrians in crowded scenes is a challenging task. Existing transformer-based tracking methods integrate detection and tracking into a unified model, which simplifies the tracking process. However, these methods also introduce complicated attention mechanisms that increase the model complexity and cost. To address this issue, we propose SOTTrack, a simple online transformer-based method for crowd tracking. Our method enhances feature learning and training strategies without sacrificing simplicity and efficiency. Specifically, we introduce the Sequential Feature Aggregation (SFA) module and the Hybrid Group Training (HGT) approach. The SFA module fuses features from sequential images to improve the temporal consistency of visual features within short time intervals. The HGT approach assigns different queries to multiple guided tasks, such as label assignment and de-noising, which are only used during training and do not incur any inference cost. We evaluate our method on the MOT17 and MOT20 datasets and demonstrate its competitive performance.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Journal of Visual Communication and Image Representation
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.