Vision-based Human Activity Recognition (HAR) aims to recognize human activities based on the analysis of video data, and has extensive applications in modern industry and human life. Inflated 3D (I3D) is a deep learning architecture that is commonly used for action recognition by using two-stream video data: RGB stream and optical flow stream. I3D achieved great success on a variety of action recognition benchmarks. However, the use of optical flow incurs high computational cost, making the approach unsuitable for real-time applications. We propose an alternative simple motion information extractor to replace the optical flow branch and reduce the computational cost. It is a modified I3D that uses 128 frames of 112x112 images as input. The low spatial resolution and long temporal range of the proposed I3D RGB stream can reduce the spatial information and enhance the motion information. Experiments show that this simple motion stream can increase the accuracy of the original I3D spatial stream by 4.09% on the Kinetics 400 dataset.
Read full abstract