Abstract

Artificial intelligence models are widely used in the field of human activity recognition, and human action recognition is an important aspect of human activity recognition. The core of human action recognition is to understand the temporal relationship between video frames. Almost all state-of-the-art methods of human action recognition in videos use optical flow. However, traditional local optical flow estimation methods areexpensive and not trained end-to-end. In this paper, we propose a fast network for human action recognition. Our purpose is to improve the efficiency of optical flow feature extraction and explore the fusion method of spatio-temporal features. For spatio-temporal features, our method combines spatial features and temporal features into fusion features. In addition, we propose CNN with OFF instead of the VGG16 network, which is used to process optical flow features to obtain abundant features. Our model only needs RGB inputs to get the state-of-the-art accuracy of 91.5% on UCF-101, 67.9% on HMDB51, 83.3% on MSR Daily Activity3D, and 91.25% on Florence 3D action, respectively. Compared with most state-of-the-art video action recognition models, our proposed model can effectively improve the accuracy of human action recognition.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call