FlexAE: A Self-Conditioned Detector To Prevent Model Overfitting For Unsupervised Video Anomaly Detection
Unsupervised Video Anomaly Detection (VAD) has garnered significant attention for its ability to exploit unlabeled videos. However, VAD faces two primary challenges arising from the absence of labels: (i) Striking a balance between overfitting and underfitting, and (ii) Optimal parameter tuning. To tackle these challenges, we propose a novel detector named Flexible AutoEncoder (FlexAE). A fitting-parameter is introduced to regulate the model’s fitting capacity, and a novel Negative Learning (NL) mechanism is integrated to mitigate the influence of anomalies during training. For self-conditioning, a novel algorithm is devised to autonomously update the fitting-parameter and the threshold used in NL based on the reconstruction error. Comprehensive experiments on two benchmark datasets, UCF-Crime and ShanghaiTech, demonstrate that our proposed FlexAE outperforms state-of-the-art methods without the need for manual hyperparameter tuning.
- Conference Article
17
- 10.1109/ijcnn.2019.8852022
- Jul 1, 2019
Video anomaly detection is a valuable but challenging task, especially in the field of surveillance videos for public safety. Almost all existing methods tackle the problem under the supervised setting and only a few attempts are conducted on the unsupervised learning. To avoid the cost of labeling training videos, this paper proposes to discriminate anomaly by a novel two-stage framework in a fully unsupervised manner. Unlike previous unsupervised approaches using local change detection to discover abnormality, our method enjoys the global information from video context by considering the pair-wise similarity of all video events. In this way, our method formulates video anomaly detection as an extension of unsupervised one-class learning, which has not been explored in the literature of video anomaly detection. Specifically, our method consists of two stages: The first stage of our kernel-based method, named Low-rank based Unsupervised One-class Learning with Ridge Regression (LR-UOCL-RR), reformulates the optimization goal of UOCL with ridge regression to avoid expensive computation, which enables our method to handle massive unlabeled data from videos. In the second stage, the estimated normal video events from the first stage are fed into the one-class support vector machine to refine the profile around normal events and enhance the performance. The experimental results conducted on two challenging video benchmarks indicate that our method is considerably superior, up to 15:7% AUC gain, to the state-of-the-art methods in the unsupervised anomaly detection task and even better than several supervised approaches.
- Conference Article
33
- 10.1109/wacv56688.2023.00266
- Jan 1, 2023
Anomaly detection in video surveillance aims to detect anomalous frames whose properties significantly differ from normal patterns. Anomalies in videos can occur in both spatial appearance and temporal motion, making unsupervised video anomaly detection challenging. To tackle this problem, we investigate forward and backward motion continuity between adjacent frames and propose a new video anomaly detection paradigm based on bi-directional frame interpolation. The proposed framework consists of an optical flow estimation network and an interpolation network jointly optimized end-to-end to synthesize a middle frame from its nearest two frames. We further introduce a novel dynamic memory mechanism to balance memory sparsity and normality representation diversity, which attenuates abnormal features in frame interpolation without affecting normal prototypes. In inference, interpolation error and dynamic memory error are fused as anomaly scores. The proposed bi-directional interpolation design improves normal frame synthesis, lowering the false alarm rate of anomaly appearance; meanwhile, the implicit "regular" motion constraint in our optical flow estimation and the novel dynamic memory mechanism play blocking roles in interpolating abnormal frames, increasing the system’s sensitivity to anomalies. Extensive experiments on public benchmarks demonstrates the superiority of the proposed framework over prior arts.
- Conference Article
- 10.24963/ijcai.2024/838
- Aug 1, 2024
Sustainable cities requires high-quality community management and surveillance analytics, which are supported by video anomaly detection techniques. However, mainstream video anomaly detection techniques still require manually labeled data and do not apply to real-world massive videos. Without labeling, unsupervised video anomaly detection (UVAD) is challenged by the problem of pseudo-labeled noise and the openness of anomaly detection. In response, a diffusion-based latent pattern learning UVAD framework is proposed, called DiffVAD. The method learns potential patterns by generating different patterns of the same event through diffusion models. The detection of anomalies is realized by evaluating the pattern distribution. The different patterns of normal events are diverse but correlated, while the different patterns of abnormal events are more diffuse. This manner of detection is equally effective for unseen normal events in the training set. In addition, we design a refinement strategy for pseudo-labels to mitigate the effects of the noise problem. Extensive experiments on six benchmark datasets demonstrate the design’s promising generalization ability and high efficiency. Specifically, DiffVAD obtains an AUC score of 81.9% on the ShanghaiTech dataset.
- Research Article
1
- 10.54097/hset.v12i.1444
- Aug 26, 2022
- Highlights in Science, Engineering and Technology
As surveillance technology is continuously improving, an ever-increasing number of cameras are being deployed everywhere. Relying on manual detection of anomalies through cameras may be unreliable and untimely. Therefore, the application of deep learning in video anomaly detection is being extensively studied. Anomaly Detection (AD) refers to identifying events that deviate from the desired actions. This article discusses representative unsupervised and weakly-supervised learning methods applied to various data types. In these machine learning methods, Generative Adversarial Network, Auto Encoder, Recurrent Neural Network, etc. are broadly adopted for AD. Some renowned and new datasets are reviewed. Furthermore, we also proposed several future directions of research in video anomaly detection.
- Research Article
189
- 10.1109/tnnls.2021.3083152
- Jun 1, 2022
- IEEE Transactions on Neural Networks and Learning Systems
Video anomaly detection is commonly used in many applications, such as security surveillance, and is very challenging. A majority of recent video anomaly detection approaches utilize deep reconstruction models, but their performance is often suboptimal because of insufficient reconstruction error differences between normal and abnormal video frames in practice. Meanwhile, frame prediction-based anomaly detection methods have shown promising performance. In this article, we propose a novel and robust unsupervised video anomaly detection method by frame prediction with a proper design which is more in line with the characteristics of surveillance videos. The proposed method is equipped with a multipath ConvGRU-based frame prediction network that can better handle semantically informative objects and areas of different scales and capture spatial-temporal dependencies in normal videos. A noise tolerance loss is introduced during training to mitigate the interference caused by background noise. Extensive experiments have been conducted on the CUHK Avenue, ShanghaiTech Campus, and UCSD Pedestrian datasets, and the results show that our proposed method outperforms existing state-of-the-art approaches. Remarkably, our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.
- Conference Article
1
- 10.1109/acait53529.2021.9731141
- Oct 29, 2021
Video anomaly detection (VAD) is commonly formulated as the discrimination of events that do not confirm to the regular patterns in videos. Recently, deep neural network-based VAD approaches have gained remarkable progresses. Existing unsupervised approaches usually achieve VAD by frame reconstruction or prediction, and then identifying anomalies according to the reconstruction or prediction errors. However, these approaches suffer from two limitations: (1) They cannot obtain the semantic features of normal training samples. (2) It is suboptimal because of the non-alignment between the proxy and actual tasks. To address the above issues, we present a novel temporal-aware self-supervised learning framework to obtain the high-level semantic features and to perform VAD by solving multiple pretext tasks. In particular, we utilize temporal transformations to form multiple pretext tasks (transformations prediction) for VAD. A 3D encoder is trained to obtain semantic features by jointly solving these pretext tasks. Then, multi task heads utilize these features to solve different pretext tasks. In the inference phase, multiple task losses are used for calculating the final anomaly score. Extensive experiments are conducted on two benchmarks, which shows that the proposed method outperforms state-of-the-arts.
- Research Article
193
- 10.1007/s11263-022-01578-9
- Feb 22, 2022
- International Journal of Computer Vision
The unsupervised detection and localization of anomalies in natural images is an intriguing and challenging problem. Anomalies manifest themselves in very different ways and an ideal benchmark dataset for this task should contain representative examples for all of them. We find that existing datasets are biased towards local structural anomalies such as scratches, dents, or contaminations. In particular, they lack anomalies in the form of violations of logical constraints, e.g., permissible objects occurring in invalid locations. We contribute a new dataset based on industrial inspection scenarios that evenly covers both types of anomalies. We provide pixel-precise ground truth data for each anomalous region and define a generalized evaluation metric that addresses localization ambiguities that can arise for logical anomalies. Furthermore, we propose a novel algorithm that improves over the state of the art in the joint detection of structural and logical anomalies. It consists of a local and a global network branch. The first one inspects confined regions independent of their spatial locations in the input image and is primarily responsible for the detection of entirely new local structures. The second one learns a globally consistent representation of the training data through a bottleneck that enables the detection of violations of long-range dependencies, a key characteristic of many logical anomalies. We perform extensive evaluations on our new dataset to corroborate our claims.
- Conference Article
185
- 10.1109/cvpr52688.2022.01433
- Jun 1, 2022
Video anomaly detection is well investigated in weakly-supervised and one-class classification (OCC) settings. However, unsupervised video anomaly detection methods are quite sparse, likely because anomalies are less frequent in occurrence and usually not well-defined, which when coupled with the absence of ground truth supervision, could adversely affect the performance of the learning algorithms. This problem is challenging yet rewarding as it can completely eradicate the costs of obtaining laborious annotations and enable such systems to be deployed without human intervention. To this end, we propose a novel unsupervised Generative Cooperative Learning (GCL) approach for video anomaly detection that exploits the low frequency of anomalies towards building a cross-supervision between a generator and a discriminator. In essence, both networks get trained in a cooperative fashion, thereby allowing unsupervised learning. We conduct extensive experiments on two large-scale video anomaly detection datasets, UCF crime and ShanghaiTech. Consistent improvement over the existing state-of-the-art unsupervised and OCC methods corroborate the effectiveness of our approach.
- Book Chapter
- 10.3233/faia230417
- Sep 28, 2023
- Frontiers in artificial intelligence and applications
Unsupervised Video Anomaly Detection (UVAD) utilizes completely unlabeled videos for training without any human intervention. Due to the existence of unlabeled abnormal videos in the training data, the performance of UVAD has a large gap compared with semi-supervised VAD, which only uses normal videos for training. To address the problem of insufficient ability of the existing UVAD methods to learn normality and reduce the negative impact of abnormal events, this paper proposes a novel Enhanced Spatio-temporal Self-selective Learning (ESSL) framework for UVAD. This framework is designed for capturing both the appearance and motion features through effective network structures by solving the spatial and temporal jigsaw puzzles. Specially, we develop a Self-selective Learning Module (SLM) for UVAD, which prevents the model learning abnormal features and enhances the model by selecting normal features. Experimental results on three benchmark datasets show that the proposed method not only surpasses the state-of-the-art UVAD works, but also achieves the performance comparable to the classic semi-supervised methods for video anomaly detection that needs normal videos selected manually. Code is available at: https://github.com/xusuger/ESSL.
- Research Article
1
- 10.1142/s021800142451011x
- Jun 29, 2024
- International Journal of Pattern Recognition and Artificial Intelligence
Video anomaly detection has always been a challenging task in computer vision due to data imbalance and susceptibility to scene variations such as lighting and occlusions. In response to this challenge, this paper proposes an unsupervised video anomaly detection method based on an attention-enhanced memory network. The method utilizes a dual-stream network structure of autoencoders, enhancing the model’s learning ability for important features in appearance and motion by introducing coordinate attention mechanisms and variance attention mechanisms, emphasizing significant characteristics of static objects and rapidly moving regions. By adding memory modules to both the appearance and motion branches, the network structure’s memory information is reinforced, enabling it to capture long-term spatiotemporal dependencies in videos and thereby improving the accuracy of anomaly detection. Furthermore, by optimizing the network structure’s activation functions to handle negative inputs, it enhances its nonlinear modeling capabilities, enabling better adaptation to complex environments, including variations in lighting and occlusions, further improving the effectiveness of anomaly detection. The paper conducts comparative experiments and ablation studies using three public available datasets and various models. The results demonstrate that compared to baseline models, the AUC performance is improved by 3.9%, 4.7%, and 1.7% on UCSD Ped2, CHUK Avenue, and ShanghaiTech datasets, respectively. When compared with the other models, the average AUC performance is improved by 4.3%, 5.4%, and 6.2%, with an average improvement of 8.75% in the ERR metric, validating the effectiveness and adaptability of the proposed method. The code can be obtained at the following URL: https://github.com/AcademicWhite/AEMNet .
- Research Article
36
- 10.1109/access.2023.3237028
- Jan 1, 2023
- IEEE Access
Anomaly detection in video is an advanced computer vision challenge that recognizes video segments containing out-of-the-ordinary motions or objects. Most recent techniques in video anomaly detection have focused on reconstruction and prediction methods; however, in practice, frame reconstruction methods deliver suboptimal results due to the outstanding generalization abilities of convolutional neural networks when reconstructing abnormal frames. Meanwhile, frame prediction methods have drawn much attention and are a powerful way of simulating the dynamics of natural scenes. This paper provides a new unsupervised frame prediction-based algorithm for anomaly detection that improves overall performance. Our suggested strategy follows a U-Net-like architecture that employs a Time-distributed 2D CNN-based encoder and 2D CNN-based decoder. A memory module is used in the design to retrieve and store the most relevant prototypical pattern of the normal scenario in the memory slots during training giving our model the capacity to produce poor predictions in the case of unusual input. For the memory module to fully retain normal semantic patterns on multiple scales, we propose an upstream multi-branch structure composed of dilated convolutions to extract contextual information. We also provide a multi-path structure that, as a great substitute for the optical flow loss function, directly includes temporal information into the network design. Experiments on the UCSD Ped1, UCSD Ped2, and CUHK Avenue benchmark datasets revealed that our design outperforms most competing models.
- Research Article
293
- 10.1016/j.imavis.2020.104078
- Nov 30, 2020
- Image and Vision Computing
A comprehensive review on deep learning-based methods for video anomaly detection
- Research Article
11
- 10.3390/s23042087
- Feb 13, 2023
- Sensors (Basel, Switzerland)
The interest in video anomaly detection systems that can detect different types of anomalies, such as violent behaviours in surveillance videos, has gained traction in recent years. The current approaches employ deep learning to perform anomaly detection in videos, but this approach has multiple problems. For example, deep learning in general has issues with noise, concept drift, explainability, and training data volumes. Additionally, anomaly detection in itself is a complex task and faces challenges such as unknownness, heterogeneity, and class imbalance. Anomaly detection using deep learning is therefore mainly constrained to generative models such as generative adversarial networks and autoencoders due to their unsupervised nature; however, even they suffer from general deep learning issues and are hard to properly train. In this paper, we explore the capabilities of the Hierarchical Temporal Memory (HTM) algorithm to perform anomaly detection in videos, as it has favorable properties such as noise tolerance and online learning which combats concept drift. We introduce a novel version of HTM, named GridHTM, which is a grid-based HTM architecture specifically for anomaly detection in complex videos such as surveillance footage. We have tested GridHTM using the VIRAT video surveillance dataset, and the subsequent evaluation results and online learning capabilities prove the great potential of using our system for real-time unsupervised anomaly detection in complex videos.
- Research Article
- 10.3390/s25185869
- Sep 19, 2025
- Sensors (Basel, Switzerland)
Video anomaly detection in unconstrained environments remains a fundamental challenge due to the scarcity of labeled anomalous data and the diversity of real-world scenarios. To address this, we propose a novel unsupervised framework that integrates RGB appearance and optical flow motion via a unified GAN-based architecture. The generator features a dual encoder and a GRU–attention temporal bottleneck, while the discriminator employs ConvLSTM layers and residual-enhanced MLPs to evaluate temporal coherence. To improve training stability and reconstruction quality, we introduce DASLoss—a composite loss that incorporates pixel, perceptual, temporal, and feature consistency terms. Experiments were conducted on three benchmark datasets. On XD-Violence, our model achieves an Average Precision (AP) of 80.5%, outperforming other unsupervised methods such as MGAFlow and Flashback. On Hockey Fight, it achieves an AUC of 0.92 and an F1-score of 0.85, demonstrating strong performance in detecting short-duration violent events. On UCSD Ped2, our model attains an AUC of 0.96, matching several state-of-the-art models despite using no supervision. These results confirm the effectiveness and generalizability of our approach in diverse anomaly detection settings.
- Conference Article
4
- 10.1109/icassp49357.2023.10094566
- Jun 4, 2023
Video anomaly detection aims to automatically detect abnormal objects or behaviors. Most existing methods tackle the problem by minimizing the reconstruction errors stemming from the lack of anomalous data, which leads to poor interpretability and robustness. Focus on the context-dependent nature of anomaly detection, a robust unsupervised Video Anomaly Detection framework based on Knowledge and Frame Prediction is proposed, called VAD-KFP. Prior knowledge which contains the context of anomaly is introduced into the multi-path frame prediction network through multi-layer Graph Convolutional Networks. By integrating the prior knowledge to accurately define anomalies, VAD-KFP is robust to different scenarios and is able to recognize the type of anomaly. An extensive range of experiments have been conducted on three benchmarks, the results of which indicate that our method outperforms strong baselines. Specifically, VAD-KFP obtains an AUROC score of 91.6% for the Avenue dataset.