Tracking multiple people in crowds is a fundamental and essential task in the multimedia field. It is often hindered by difficulties, such as dynamic occlusion between objects, cluttered background, and abrupt illumination changes. To respond to this need, in this paper, we combine deep and depth to build a stereo tracking system for crowds. The core of the system is the fusion of the advantages of deep learning and depth information, which is exploited to achieve object segmentation and improve the multiobject tracking performance in severe occlusion. More specifically, first, to obtain more accurate detection observations in the tracking system, we present a novel object-level segmentation method. This method combines the effective detection results of deep learning with depth information to obtain precise object segmentation results. Then, we integrate the segmentation results and three-dimensional (3-D) information to extract 2-D and 3-D characteristics to represent the target, and design three similarity models to realize a stereo tracking method through data association in crowds. Finally, we build a diverse stereo dataset including various challenging indoor and outdoor scenes. The comprehensive experiments verify the effective and robust tracking performance of our system in various scenarios, and the system has rich output results including segmentation results, target distance, and tracking results. Moreover, the qualitative and quantitative comparison results show that the proposed algorithm not only has good object segmentation performance but also improves the tracking performance of completely and partially occluded objects, which is superior to the tested state-of-the-art tracking approaches.
Read full abstract