Human detection and tracking in surveillance videos

Bing Wang

doi:10.32657/10356/65919

Abstract

The thesis addresses the following challenging problems of detecting and tracking humans in the presence of occlusions in typical surveillance videos: (1) adaptation of semantic-part-based human detectors to new surveillance video sequence when trained detectors using other video data not performing well on the new video data; (2) tracking of humans with person identification minimizing identification errors over longer tracking periods; and (3) hierarchical spatial and temporal analysis for discriminative tracking of human targets. The thesis aims to improve the state-of-the-art performance in human detection and tracking by studying the human detectors, extended tracking of track segments (tracklets) generated from short term tracking of detection responses. For the adaptation of semantic-part-based human detectors to new surveillance video sequence, a uni ed deep CNN model for joint learning of features, semantic pedestrian part detectors and a transfer learning model is developed. The components within this deep CNN model interact with each other in the learning process, which facilitates the optimization of the learned components during the co-operative learning. In particular, an adaptation layer is proposed to embed the capability of knowledge transfer into the CNN model. As a result, the proposed transferred CNN (T-CNN) model is able to transfer the visual knowledge of the semantic pedestrian parts from the source data to target data. Extensive experimental evaluations show that the proposed method is better than other deep learning based methods in terms of detection performance. Moreover, the adaptive deep features can be complementary to the pre-defined features used by other state-of-the-art methods. For tracking of humans with person identification minimizing identification errors over longer tracking periods, a novel method, based on online target-specific metric learning and coherent dynamics estimation, for tracklet association by network flow optimization is developed. The proposed framework aims to exploit appearance and motion cues to prevent identity switches during tracking and also to recover missed detections. The target-specific metrics (appearance cue) and motion dynamics (motion cue) are proposed to be learned and estimated online, i.e. during the tracking process. Furthermore, a learning algorithm to learn the weights of motion and appearance tracking cues for tracklet affinity models is proposed to handle some difficult situations. Extensive evaluations following state-of-the-art practices have been conducted and the results from these evaluations show the improvements by the proposed method over some existing state-of-the-art methods. In hierarchical spatial and temporal analysis for discriminative tracking of human targets, inspired by recent advances in convolutional neural network (CNN) architectures, a novel uni ed deep model for tracklet association, which can jointly learn the CNNs and temporally constrained metrics, is developed. Furthermore, a novel loss function incorporating…

Full Text