Video Anomaly Detection (VAD) aims to automatically identify unexpected spatial–temporal patterns to detect abnormal events in surveillance videos. Existing unsupervised VAD methods either use a single proxy task to represent the prototypical normal patterns, or split the video into static and dynamic parts and learn spatial and temporal normality separately, which cannot model the inherent uncertainty of motion. In this regard, we propose for the first time to learn video normality via stochastic motion representation. Specifically, the proposed Stochastic Video Normality (SVN) network learns the prototypical local appearance patterns via deterministic multi-task learning and global motion patterns in a non-deterministic manner. The stochastic representation module uses recurrent networks to model historical motion as a conditional Gaussian distribution and infers motion explicitly with the learned prior. Besides, a mask auto-encoder is introduced to explore the relationship between appearance and motion, boosting the model to learn spatial–temporal normality. Experimental results show that our proposed SVN network performs comparably to the state-of-the-art methods. Extensive analysis demonstrates the effectiveness of multi-task learning-based appearance representation and stochastic motion representation for unsupervised video anomaly detection.
Read full abstract