Abstract

In the current era of technological development, human actions can be recorded in public places like airports, shopping malls, and educational institutes, etc., to monitor suspicious activities like terrorism, fighting, theft, and vandalism. Surveillance videos contain adequate visual and motion information for events that occur within a camera’s view. Our study focuses on the concept that actions are a sequence of moving body parts. In this paper, a new descriptor is proposed that formulates human poses and tracks the relative motion of human body parts along with the video frames, and extracts the position and orientation of body parts. We used Part Affinity Fields (PAFs) to acquire the associated body parts of the people present in the frame. The architecture jointly learns the body parts and their associations with other body parts in a sequential process, such that a pose can be formulated step by step. We can obtain the complete pose with a limited number of points as it moves along the video and we can conclude with a defined action. Later, these feature points are classified with a Support Vector Machine (SVM). The proposed work was evaluated on the benchmark datasets, namely, UT-interaction, UCF11, CASIA, and HCA datasets. Our proposed scheme was evaluated on the aforementioned datasets, which contained criminal/suspicious actions, such as kick, punch, push, gun shooting, and sword-fighting, and achieved an accuracy of 96.4% on UT-interaction, 99% on UCF11, 98% on CASIA and 88.72% on HCA.

Highlights

  • Government and security institutions install surveillance cameras in homes, markets, hospitals, shopping malls, and public places to capture real-time events to ensure the safety of people

  • Datasets. e algorithm was assessed on four action datasets: the UT-Interaction dataset [40], the YouTube action dataset [41], the CASIA dataset [42], and Hybrid Criminal Action (HCA) [43]

  • E runtime performance of our approach depends on the number of people present in the video. e number of interest points increased with people’s count, which is the main factor in calculating the time complexity

Read more

Summary

Introduction

Government and security institutions install surveillance cameras in homes, markets, hospitals, shopping malls, and public places to capture real-time events to ensure the safety of people. Pose estimations and tracking of body parts are useful methods of identification. Ese descriptors use different methods for feature extraction from the various regions of video. Fisher Vectors [11] and Vector of Locally Aggregated Descriptors (VLAD) [12] are commonly used; such methodologies provide good performance for many solutions [8, 10, 13] These encoding schemes lack spatiotemporal data, which is vital while dealing with videos. Our approach utilizes the preprocessing steps employed in [28], as the idea is to extract the motion information of the human body parts. E performance of our approach depends on how accurately the human poses are estimated and linked with the associated body parts.

Proposed Approach
Experiments and Discussion
Results
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.