Abstract

In view of difficulty in application of optical flow based human action recognition due to large amount of calculation, a human action recognition algorithm I3D-shufflenet model is proposed combining the advantages of I3D neural network and lightweight model shufflenet. The 5 × 5 convolution kernel of I3D is replaced by a double 3 × 3 convolution kernels, which reduces the amount of calculations. The shuffle layer is adopted to achieve feature exchange. The recognition and classification of human action is performed based on trained I3D-shufflenet model. The experimental results show that the shuffle layer improves the composition of features in each channel which can promote the utilization of useful information. The Histogram of Oriented Gradients (HOG) spatial-temporal features of the object are extracted for training, which can significantly improve the ability of human action expression and reduce the calculation of feature extraction. The I3D-shufflenet is testified on the UCF101 dataset, and compared with other models. The final result shows that the I3D-shufflenet has higher accuracy than the original I3D with an accuracy of 96.4%.

Highlights

  • With the development of artificial intelligence, the progress of computer vision has received special attention

  • In the 1970s, a human body description model was proposed by Professor Johansson [1], which had a great impact on human body recognition

  • With the development of deep learning, it is widely used in the field of human action recognition, which greatly improves the accuracy of human action recognition

Read more

Summary

Introduction

With the development of artificial intelligence, the progress of computer vision has received special attention. Ji et al [8] proposed a three-dimensional convolutional neural network for spatial-temporal features extraction. The feature extracted by the proposed model will be exchanged with different channel features through shuffle operation [17], and more useful information will be used to improve the performance of human action recognition. Traditional 2D convolution is suitable for spatial feature extraction, and has difficulty with continuous frame processing of video data. Traditional deep learning network generally uses a single-size convolution kernel, the input data are processed by the convolution kernel and a feature set is generated. The I3D network inherits the Inception module of Googlenet, and with different size convolution kernels for feature extraction. The I3D neural network adds a convolution operation for adjacent temporal information, which can complete the action recognition of continuous frame.

I3D-Shufflenet
Channel Shuffle
I3D-Shufflenet Structure
Experiment
Hyperparameter Settings
Loss Function
Learning
10. Confusion
Feature Map Output
12. Feature
Class Activation Mapping
Comparisons
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call