Abstract

Tailing is defined as an event where a suspicious person follows someone closely. We define the problem of tailing detection from videos as an anomaly detection problem, where the goal is to find abnormalities in the walking pattern of the pedestrians (victim and follower). We, therefore, propose a modified Time-Series Vision Transformer (TSViT), a method for anomaly detection in video, specifically for tailing detection with a small dataset. We introduce an effective way to train TSViT with a small dataset by regularizing the prediction model. To do so, we first encode the spatial information of the pedestrians into 2D patterns and then pass them as tokens to the TSViT. Through a series of experiments, we show that the tailing detection on a small dataset using TSViT outperforms popular CNN-based architectures, as the CNN architectures tend to overfit with a small dataset of time-series images. We also show that when using time-series images, the performance of CNN-based architecture gradually drops, as the network depth is increased, to increase its capacity. On the other hand, a decreasing number of heads in Vision Transformer architecture shows good performance on time-series images, and the performance is further increased as the input resolution of the images is increased. Experimental results demonstrate that the TSViT performs better than the handcrafted rule-based method and CNN-based method for tailing detection. TSViT can be used in many applications for video anomaly detection, even with a small dataset.

Highlights

  • Tailing is a situation in which one pedestrian follows another pedestrian in the same direction for some amount of time

  • Transformer-based “Base” model Time-Series Vision Transformer (TSViT)-B/512 exhibits the best accuracy of 76.56% in comparison with the 63.54% best accuracy of simple-Convolutional Neural Network (CNN), a 13.02% improvement

  • A method of tailing detection based on Vision Transformer is proposed, which is an end-to-end trainable framework

Read more

Summary

Introduction

Tailing is a situation in which one pedestrian follows another pedestrian in the same direction for some amount of time. The intention of a tailing person can range from assaulting, snatching, or even kidnapping. According to a survey [2], the surveillance cameras installed in 2016 worldwide will produce approximately 566 GB of data in one day. This rapid growth of surveillance video data presents higher challenges for video processing and understanding. The development of computer vision techniques of eventdetection [3], video retrieval [4] and video summarizing [5] are eminent part of modern surveillance systems

Methods
Findings
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.