Abstract

Human action recognition is an important task in the fields of video content analysis and computer vision. Since the performance of most existing action recognition frameworks depends on the representation of features, many researches aim to construct more discriminative features. In this paper, we propose a manifold learning framework based on optical flow for action recognition. First, we calculate the dense optical flow field of the original video sequence, and the attention pooling layer (AP) is adopted to separate target area and background area to eliminate background interference. On this basis, motion features (MF) based on the physical characteristics of dense optical flow are developed to characterize human motion information. After that, manifold learning is introduced to calculate the motion variance features (MVF), which reflect the change rate of motion features and measure the spatial correlation between features in non-Euclidean space. Finally, fusing the MVF obtained by manifold learning and MF, feeding fusion features into two fully connected layers (FC) in series for action classification and recognition. Experiments on several classic datasets show that the proposed method achieves 0.98%, 1.86% and 0.99% performance improvement on UCF 101, HMDB51 and JHMDB.

Highlights

  • T HE purpose of human action recognition (HAR) is to realize understanding of human behavior by analysing and processing the video containing human behavior

  • The research of HAR has made significant progress in image segmentation [1]–[4], target detection [5]–[8] and etc., it is still confronted with a great challenge because of the diversity and high non-linearity of human behavior, which is caused by the non-rigid structure of human body and the confusion of background and motion feature, etc

  • The mainstream action recognition framework is mainly limited by the following three aspects: (1) deep learning framework often needs to be trained with a large number of parameters, which is easy to fall into the disaster of dimensionality; (2) Due to the one-sidedness of manual features, its recognition ability is not enough to characterize motion states; (3) the intense interference caused by complex background confuses the recognition model

Read more

Summary

Introduction

T HE purpose of human action recognition (HAR) is to realize understanding of human behavior by analysing and processing the video containing human behavior. The mainstream action recognition framework is mainly limited by the following three aspects: (1) deep learning framework often needs to be trained with a large number of parameters, which is easy to fall into the disaster of dimensionality; (2) Due to the one-sidedness of manual features, its recognition ability is not enough to characterize motion states; (3) the intense interference caused by complex background confuses the recognition model. The attention pooling layer (AP) is inserted into the traditional 3-layer CNN structure to capture the region of interest (ROI) in continuous video frames The purpose of this step is to reduce the interference caused by the background, and reduce the computational burden, effectively. Concatenating MF with the motion variation features, and fed them into two fully connected layers to complete the action recognition tasks

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call