Abstract

Multiple-object tracking has long been a topic of interest since it plays an important role in many computer vision applications. Existing works are mostly designed for outdoor tracking, such as video surveillance and autonomous driving. However, the behaviors of objects in outdoor tracking scenarios do not fully reflect the tracking challenges in indoor tracking environments. In outdoor tracking scenarios, pedestrians and vehicles usually move uniformly from place to place on a simple straight path, and target appearances are usually different. In contrast, in indoor scenarios, such as choreographed performances, the dynamic behaviors of dancers lead to severe occlusions, and similar costumes present a homogeneous appearance problem. These severe occlusion and homogeneous appearance problems in indoor tracking lead to noticeable degradation in the performance of existing works. In this paper, we propose a depth-enhanced tracking-by-detection framework and a semantic matching strategy combined with a scene-aware affinity measurement method to mitigate occlusion and homogeneous appearance problems significantly. In addition, we introduce an indoor tracking dataset and increase the diversity of existing benchmark datasets for indoor tracking evaluation. We conduct experiments on both the proposed indoor tracking dataset and the latest MOT benchmarks, MOT17 and MOT20. The experimental results show that our method consistently outperforms other works on the convincing HOTA metric across the benchmarks and greatly reduces the number of identity switches by 20% compared to that of the second-best tracker, DeepSORT, in our proposed indoor MOT benchmark dataset.

Highlights

  • The goal of multiple-object tracking (MOT) is to consistently localize and identify several objects in a video sequence

  • We present an indoor tracking dataset called NTU-MOTD that increases the diversity of existing MOT benchmark datasets and better reveals the tracking challenges in indoor tracking environments

  • We propose a depth-enhanced tracker (DET) incorporating a depth estimation module into the typical tracking-by-detection framework to solve the severe occlusion problem effectively, and we introduce the semantic matching strategy combined with the scene-aware affinity measurement method to make the critical association process robust to different tracking scenarios, such as outdoor scenes and indoor scenes in which objects have similar appearances

Read more

Summary

INTRODUCTION

The goal of multiple-object tracking (MOT) is to consistently localize and identify several objects in a video sequence. The video sequences in these datasets are mostly captured in outdoor scenes, and the objects of interest are usually pedestrians who are dressed differently and move in a single direction These datasets do not genuinely reveal the tracking challenges depicted in Figure 2 in indoor tracking scenarios. The video sequences are primarily traffic flows mixed with vehicles, cyclists, and pedestrians In contrast to these datasets focused on pedestrian and vehicle tracking in outdoor environments, the proposed dataset is acquired in indoor environments and simulates choreographed and stage performance scenes, presenting frequent occlusions and objects with similar appearances. DATASET STATISTICS To cover a variety of indoor tracking scenarios, we define four different recording conditions to realize in the dataset These recording conditions are the number of people in the video, the similarity of the appearances of objects, the movement patterns of objects in the space, and the diversity of poses. The dataset we propose consistently retains the identity information of the object, which can faithfully reflect trackers’ abilities to preserve object identities

PROPOSED METHOD
MOT BENCHMARK EVALUATION
LIMITATIONS AND FUTURE
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call