Abstract

Multi-Object Tracking (MOT) is a cornerstone operator for video surveillance applications. To enable real-time processing of large-scale live video streams, we study an interesting scenario called down-sampled MOT, which performs object tracking only on a small subset of video frames. The problem is challenging for state-of-the-art MOT methods, which exhibit significant performance degradation under high frame reduction ratios. In this paper, we devise a sampling-resilient tracker with a novel sparse-observation Kalman filter (SOKF). It integrates an LSTM network to capture non-linear and dynamic motion patterns caused by sparse observations. Since the LSTM-based state transition is not compatible with the original noise estimation mechanism, we propose new estimation strategies based on Bayesian neural networks and derive the optimal Kalman gain for SOKF. To associate the detected bounding boxes robustly, we also propose a comprehensive similarity metric that systematically integrates multiple spatial matching signals. Experiments on three benchmark datasets show that our proposed tracker achieves the best trade-off between efficiency and accuracy. With the same tracking accuracy, we reduce the total processing time of ByteTrack by 2× in MOT17 and 3× in DanceTrack.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call