Abstract
Human action recognition is an important task in the fields of video content analysis and computer vision. Since the performance of most existing action recognition frameworks depends on the representation of features, many researches aim to construct more discriminative features. In this paper, we propose a manifold learning framework based on optical flow for action recognition. First, we calculate the dense optical flow field of the original video sequence, and the attention pooling layer (AP) is adopted to separate target area and background area to eliminate background interference. On this basis, motion features (MF) based on the physical characteristics of dense optical flow are developed to characterize human motion information. After that, manifold learning is introduced to calculate the motion variance features (MVF), which reflect the change rate of motion features and measure the spatial correlation between features in non-Euclidean space. Finally, fusing the MVF obtained by manifold learning and MF, feeding fusion features into two fully connected layers (FC) in series for action classification and recognition. Experiments on several classic datasets show that the proposed method achieves 0.98%, 1.86% and 0.99% performance improvement on UCF 101, HMDB51 and JHMDB.
Highlights
T HE purpose of human action recognition (HAR) is to realize understanding of human behavior by analysing and processing the video containing human behavior
The research of HAR has made significant progress in image segmentation [1]–[4], target detection [5]–[8] and etc., it is still confronted with a great challenge because of the diversity and high non-linearity of human behavior, which is caused by the non-rigid structure of human body and the confusion of background and motion feature, etc
The mainstream action recognition framework is mainly limited by the following three aspects: (1) deep learning framework often needs to be trained with a large number of parameters, which is easy to fall into the disaster of dimensionality; (2) Due to the one-sidedness of manual features, its recognition ability is not enough to characterize motion states; (3) the intense interference caused by complex background confuses the recognition model
Summary
T HE purpose of human action recognition (HAR) is to realize understanding of human behavior by analysing and processing the video containing human behavior. The mainstream action recognition framework is mainly limited by the following three aspects: (1) deep learning framework often needs to be trained with a large number of parameters, which is easy to fall into the disaster of dimensionality; (2) Due to the one-sidedness of manual features, its recognition ability is not enough to characterize motion states; (3) the intense interference caused by complex background confuses the recognition model. The attention pooling layer (AP) is inserted into the traditional 3-layer CNN structure to capture the region of interest (ROI) in continuous video frames The purpose of this step is to reduce the interference caused by the background, and reduce the computational burden, effectively. Concatenating MF with the motion variation features, and fed them into two fully connected layers to complete the action recognition tasks
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.