Abstract

Weakly supervised video anomaly detection is generally formulated as a multiple instance learning (MIL) problem, where an anomaly detector learns to generate frame-level anomaly scores under the supervision of MIL-based video-level classification. However, most previous works suffer from two drawbacks: 1) they lack ability to model temporal relationships between video segments and 2) they cannot extract sufficient discriminative features to separate normal and anomalous snippets. In this article, we develop a weakly supervised temporal discriminative (WSTD) paradigm, that aims to leverage both temporal relation and feature discrimination to mitigate the above drawbacks. To this end, we propose a transformer-styled temporal feature aggregator (TTFA) and a self-guided discriminative feature encoder (SDFE). Specifically, TTFA captures multiple types of temporal relationships between video snippets from different feature subspaces, while SDFE enhances the discriminative powers of features by clustering normal snippets and maximizing the separability between anomalous snippets and normal centers in embedding space. Experimental results on three public benchmarks indicate that WSTD outperforms state-of-the-art unsupervised and weakly supervised methods, which verifies the superiority of the proposed method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call