Abstract

Convolutional neural networks, which have achieved outstanding performance in image recognition, have been extensively applied to action recognition. The mainstream approaches to video understanding can be categorized into two-dimensional and three-dimensional convolutional neural networks. Although three-dimensional convolutional filters can learn the temporal correlation between different frames by extracting the features of multiple frames simultaneously, it results in an explosive number of parameters and calculation cost. Methods based on two-dimensional convolutional neural networks use fewer parameters; they often incorporate optical flow to compensate for their inability to learn temporal relationships. However, calculating the corresponding optical flow results in additional calculation cost; further, it necessitates the use of another model to learn the features of optical flow. We proposed an action recognition framework based on the two-dimensional convolutional neural network; therefore, it was necessary to resolve the lack of temporal relationships. To expand the temporal receptive field, we proposed a multi-scale temporal shift module, which was then combined with a temporal feature difference extraction module to extract the difference between the features of different frames. Finally, the model was compressed to make it more compact. We evaluated our method on two major action recognition benchmarks: the HMDB51 and UCF-101 datasets. Before compression, the proposed method achieved an accuracy of 72.83% on the HMDB51 dataset and 96.25% on the UCF-101 dataset. Following compression, the accuracy was still impressive, at 95.57% and 72.19% on each dataset. The final model was more compact than most related works.

Highlights

  • In the field of computer vision, human action recognition has become increasingly research-worthy

  • Methods based on two-dimensional convolutional neural networks use fewer parameters; they often incorporate optical flow to compensate for their inability to learn temporal relationships

  • We proposed an action recognition framework based on the two-dimensional convolutional neural network; it was necessary to resolve the lack of temporal relationships

Read more

Summary

Introduction

In the field of computer vision, human action recognition has become increasingly research-worthy. With the development of technology, action recognition has wide applications in the present era. Several studies on action recognition led to the direct inflation of the filters of these models from two-dimensional (2D) to three-dimensional (3D) to obtain inflated 3D ConvNets (I3D) [7], resolution 3D LLC (Res3D) [8], ResNeXt3D [9], among other models. There are two main approaches to action recognition: 2D CNN (convolutional neural network) and 3D CNN. The 2D CNN method performs convolution on one frame at a time, without temporal fusion. The 3D CNN method performs convolution on multiple frames using 3D convolutional filters to achieve spatio-temporal learning

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call