Multi-object tracking tasks aim to assign unique trajectory codes to targets in video frames. Most detection-based tracking methods use Kalman filtering algorithms for trajectory prediction, directly utilizing associated target features for trajectory updates. However, this approach often fails, with camera jitter and transient target loss in real-world scenarios. This paper rethinks state prediction and fusion based on target temporal features to address these issues and proposes the SimpleTrackV2 algorithm, building on the previously designed SimpleTrack. Firstly, to address the poor prediction performance of linear motion models in complex scenes, we designed a target state prediction algorithm called LSTM-MP, based on long short-term memory (LSTM). This algorithm encodes the target's historical motion information using LSTM and decodes it with a multilayer perceptron (MLP) to achieve target state prediction. Secondly, to mitigate the effect of occlusion on target state saliency, we designed a spatiotemporal attention-based target appearance feature fusion (TSA-FF) target state fusion algorithm based on the attention mechanism. TSA-FF calculates adaptive fusion coefficients to enhance target state fusion, thereby improving the accuracy of subsequent data association. To demonstrate the effectiveness of the proposed method, we compared SimpleTrackV2 with the baseline model SimpleTrack on the MOT17 dataset. We also conducted ablation experiments on TSA-FF and LSTM-MP for SimpleTrackV2, exploring the optimal number of fusion frames and the impact of different loss functions on model performance. The experimental results show that SimpleTrackV2 handles camera jitter and target occlusion better, achieving improvements of 1.6%, 3.2%, and 6.1% in MOTA, IDF1, and HOTA, respectively, compared to the SimpleTrack algorithm.
Read full abstract