Abstract

Surveillance video anomaly detection (SVAD) is a challenging task due to the variations in object scale, discrimination and unexpected events, the impact of the background, and the wide range of definitions of anomalous events in different surveillance contexts. In this work, we introduce an end-to-end hybrid convolution neural network (CNN) and vision transformer-based framework for anomaly detection. The proposed framework uses spatial and temporal information from a surveillance video to detect anomalous events and operates in two steps: in the first step, an efficient backbone CNN model is used for spatial feature extraction, while in the second step, these features are passed from the transformer-based model to learn the long-term temporal relationships between various complex surveillance events. The features from the backbone model are fed to a sequential learning model in which temporal self-attention is utilised to generate an attention map; this allows the proposed framework to learn the spatiotemporal features effectively and to detect anomalous events. Our experimental results on various benchmark VAD datasets prove the validity of the proposed framework, which outperforms other state-of-the-art approaches by achieving high AUC values of 94.6%, 98.4%, and 89.6% on the ShanghaiTech, UCSD Ped2 and CUHK avenue datasets, respectively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.