Weakly supervised video anomaly detection is the task of detecting anomalous frames in videos where no frame-level labels are provided at training phase. Previous methods usually employed a multiple instance learning (MIL)-based ranking loss to ensure inter-class separation. However, these methods are unable to completely utilise the information from the huge amounts of normal frames. Moreover, the performance of these methods is misguided by the erroneous initial prediction of the MIL-based classifier. Taking these shortcomings into consideration, we propose a diffusion-based normality learning pretrain step, which first involves training a Global–Local Feature Encoder (GLFE) model with only normal videos to understand the feature distribution of normal frames. The resulting pre-trained Global–Local feature encoder is further optimised using Multi-Sequence Contrastive loss using both normal and anomalous videos. Our proposed GLFE model captures long- and short-range temporal features using a Transformer block and pyramid of dilated convolutions in a two-branch setup. The model adaptively learns the relation between the two branch features by introducing the Co-Attention module, which provides a learnable fusion of features. Additionally we introduced a triplet contrastive loss to provide better separation between abnormal and normal frames in anomalous videos. The developed methodology is evaluated through extensive experiments on two public benchmark datasets (UCF-Crime and ShanghaiTech). The results obtained are comparable to or better than the existing state-of-the-art weakly supervised methods.
Read full abstract