Abstract

A Video anomaly detection detects abnormal content that does not appear in the training set, and unsupervised methods such as generative methods are applied because the training set only offers normal content. Recently, skeleton-based features have been leveraged to alleviate the background distractions meanwhile outstanding human motion. However, we regard the appearance and skeleton information as complementary, and propose a generative method consisting of an appearance branch implemented by a 3D U-Net and a skeleton branch implemented by a novel Skeleton-Transformer. Moreover, a Multi-head Co-attention-based fusion module is proposed to fuse the intermediate features extracted from the appearance and skeleton branches, and then transfer the fusion information to each branch. This fusion module addresses the challenge of maintaining the feature structure of each branch during the fusion process, which is essential in the generative model. Experimental results show that the fusion module improves the performances of the appearance branch and skeleton branch, and their combination achieves state-of-the-art performance on HR-ShanghaiTech and Corridor datasets.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call