R(2+1)D-based Two-stream CNN for Human Activities Recognition in Videos

Min Huang,Wenbo Xiang,Yi Han,Huimin Qian

doi:10.23919/ccc52363.2021.9549432

Abstract

Two-stream convolutional neural network (CNN) and 3D CNN are the most popular networks for recognizing human activities, but both have disadvantages. Therefore, the combination of 3D CNN and two-stream CNN has become a new hotspot. In this paper, an end-to-end R(2+1)D-based two-stream CNN, in which (2+1)D Resnet CNN is employed in both spatial and temporal stream network, is proposed for human activities recognition. Specifically, in the temporal stream, PWC-Net is introduced to generate optical flow images from RGB image sequence of videos, which is taken as the input of (2+1)D Resnet of temporal stream network. In the spatial stream, RGB image sequence of videos is set as the input of (2+1)D Resnet. Both spatial and temporal stream CNNs are pre-trained on Kinect 400 to improve the performance. And the prediction results of spatial and temporal streams are combined according to fusion method. In addition, the optimal length of input clip is determined by experiments to improve the recognition accuracy further. Experimental results show that owing to the presented developments the accuracy of the proposed end-to-end R(2+1)D-based two-stream CNN on UCF101 has been increased, which is 94.97%.

Full Text