Video anomaly detection (VAD) has been intensively studied for years because of its potential applications in intelligent video systems. Prediction-based VAD methods have gained a lot of attention in recent years due to their simplicity and effectiveness, which learn normality through a frame prediction tasks. However, we find that they often consider only local or global normality in the temporal dimension thus are difficult to generalize for various scenarios. Specifically, some of them focus on learning local spatiotemporal representations from consecutive frames to enhance the representation for normal events while the other methods are devoted to memorizing prototypical normal patterns of whole training videos to weaken the generalization for anomalies. In this paper, we rethink these two types of methods and find that the former methods are difficult to adapt to simple scenarios due to powerful representation allows these methods to represent some anomalies and causes miss detection while the latter struggle with complex scenarios due to the limitations for diverse normal patterns and causes false alarm. To verify our discoveries, we design a two-branch model, Local–Global Normality Network (LGN-Net), to simultaneously learn local and global normality. We verified the effect of local and global normality on the VAD model by ablation study and find that the model can be more adaptive to different scenarios when accounting for both normality. The code is available online: https://github.com/Myzhao1999/LGN-Net.
Read full abstract