Abstract

With the advent of growing digital technology, large amount of video data is being generated, making video analytics a promising technology. Human activity recognition in videos is currently receiving increased attention and activity recognition systems are a large field of research and development with a focus on advanced machine learning algorithms, innovations in the field of hardware architecture, and on decreasing the costs of monitoring while increasing safety (Guo and Lai in Pattern Recognit 47:3343–3361, 2014, [1]). The existing system for action recognition involves using Convolutional Neural Networks (CNN). Videos are taken as a sequence of frames and frame-level CNN sequence features generated are fed to Long Short-Term Memory (LSTM) model for video recognition. However, the abovementioned methodology takes frame-level CNN sequence features as input for LSTM, which may fail to capture the rich motion information from adjacent frames or multiple clips. It is important to consider adjacent frames that allow for salient features, instead of mapping an entire frame into a static representation. Thereby, to mitigate this drawback, a new methodology is proposed wherein initially, saliency-aware methods are applied to generate saliency-aware videos. Then, an end-to-end pipeline is designed by integrating 3D CNN with LSTM, followed by a time series pooling layer and a softmax layer to predict the activities in video.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call