Surveillance video anomaly detection (SVAD) is a challenging task due to the variations in object scale, discrimination and unexpected events, the impact of the background, and the wide range of definitions of anomalous events in different surveillance contexts. In this work, we introduce an end-to-end hybrid convolution neural network (CNN) and vision transformer-based framework for anomaly detection. The proposed framework uses spatial and temporal information from a surveillance video to detect anomalous events and operates in two steps: in the first step, an efficient backbone CNN model is used for spatial feature extraction, while in the second step, these features are passed from the transformer-based model to learn the long-term temporal relationships between various complex surveillance events. The features from the backbone model are fed to a sequential learning model in which temporal self-attention is utilised to generate an attention map; this allows the proposed framework to learn the spatiotemporal features effectively and to detect anomalous events. Our experimental results on various benchmark VAD datasets prove the validity of the proposed framework, which outperforms other state-of-the-art approaches by achieving high AUC values of 94.6%, 98.4%, and 89.6% on the ShanghaiTech, UCSD Ped2 and CUHK avenue datasets, respectively.
Read full abstract