Abnormal event detection in large videos is an important task in research and industrial applications, which has attracted considerable attention in recent years. Existing methods usually solve this problem by extracting local features and then learning an outlier detection model on training videos. However, most previous approaches merely employ hand-crafted visual features, which is a clear disadvantage due to their limited representation capacity. In this paper, we present a novel unsupervised deep feature learning algorithm for the abnormal event detection problem. To exploit the spatiotemporal information of the inputs, we utilize the deep three-dimensional convolutional network (C3D) to perform feature extraction. Then, the key problem is how to train the C3D network without any category labels. Here, we employ the sparse coding results of the hand-crafted features generated from the inputs to guide the unsupervised feature learning. Specifically, we define a multilevel similarity relationship between these inputs according to the statistical information of the shared atoms. In the following, we introduce the quadruplet concept to model the multilevel similarity structure, which could be used to construct a generalized triplet loss for training the C3D network. Furthermore, the C3D network could be utilized to generate the features for sparse coding again, and this pipeline could be iterated for several times. By jointly optimizing between the sparse coding and the unsupervised feature learning, we can obtain robust and rich feature representations. Based on the learned representations, the sparse reconstruction error is applied to predicting the anomaly score of each testing input. Experiments on several publicly available video surveillance datasets in comparison with a number of existing works demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
Read full abstract