Abstract

Human Action Recognition (HAR) is a challenging problem in computer vision that has received a great deal of attention in the last decade. With the advent of new deep learning techniques such as convolutional neural networks (CNNs), the recognition performance of HAR systems has improved significantly over traditional methods, mainly due to the powerful representation capabilities of CNNs. In most of the literature, 2D CNNs or their 3D counterparts have been used to learn spatial and temporal image-level features of videos. In this paper, we developed an end-to-end HAR framework based on a hybrid 2D/3D CNN. The hybrid CNN feature extractor aims to exploit the potential collaboration between 2D and 3D CNNs. The CNN features extracted from the video sequences are then fed into a Long Short-Term Memory (LSTM) network to capture the short- and long-term dependencies in the data structure. Inspired by human visual attention mechanisms, a visual attention module was used in this study to focus semantically on relevant salient features in visual representations. The developed model was trained and evaluated using the KTH dataset and achieved promising recognition performance compared to state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call