Abstract

In the field of human action recognition (HAR), two-stream models have been widely employed. In recent years, traditional two-stream network models have disregarded the interframe sequence characteristics of video, resulting in a decrease in model robustness when local sequence information and long-term motion information interact. In light of this, a novel three-stream neural network is proposed by combining the long-term and short-term characteristics of a frame sequence with spatio-temporal information. Initially, the optical flow sequence image frames and RGB image frames in the video are extracted, the optical flow motion information and image space information in the video is obtained, the corresponding time network and space network are entered, and the spatial information is entered into the sequence feature processing network; the three networks are then pretrained. At the conclusion of training, the operation of feature extraction is executed, the features are incorporated with the parallel fusion algorithm by adding weights, and the behavior categories are classified using Multi-Layer Perception. Experimental results on the UCF11, UCF50, and HMDB51 datasets demonstrate that our model effectively integrates the spatial-temporal and frame-sequence information of human actions, resulting in a significant improvement in recognition accuracy. Its classification accuracy on the three datasets was 99.17%, 97.40%, and 96.88%, respectively, significantly enhancing the generalization capability and validity of conventional two-stream or three-stream models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call