Abstract

Weakly supervised anomalous behavior detection is a popular area at present. Compared to semi-supervised anomalous behavior detection, weakly-supervised learning both eliminates the need to crop videos and solves the problem of semi-supervised learning’s difficulty in handling long videos. Previous work has used graph convolution or self-attention mechanisms to model temporal relationships. However, these methods tend to model temporal relationships at a single scale and lack consideration of the aggregation problem for different temporal relationships. In this paper, we propose a weakly supervised anomaly detection framework, MTDA-Net, with emphasis on modeling different temporal relationships and enhanced semantic discrimination. To this end, we construct a new plug-and-play module, MTDA, which uses three branches, Multi-headed Attention (MHA), Temporal Shift (TS), and Dilated Aggregation (DA), to extract different temporal sequences. Specifically, the MHA branch can globally model the video information and project the features into different semantic spaces to enhance the expressiveness and discrimination of the features. The DA branch extracts temporal information of different scales via dilated convolution and captures the temporal features of local regions in the video. The TS branch can fuse the features of adjacent frames on a local scale and enhance the information flow. MTDA-Net can learn the temporal relationships between video segments on different branches and learn powerful video representations based on these relationships. The experimental results on the XD-Violence dataset show that MTDA-Net can significantly improve the detection accuracy of abnormal behaviors.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call