Abstract

Two-stream convolutional neural network (CNN) and 3D CNN are the most popular networks for recognizing human activities, but both have disadvantages. Therefore, the combination of 3D CNN and two-stream CNN has become a new hotspot. In this paper, an end-to-end R(2+1)D-based two-stream CNN, in which (2+1)D Resnet CNN is employed in both spatial and temporal stream network, is proposed for human activities recognition. Specifically, in the temporal stream, PWC-Net is introduced to generate optical flow images from RGB image sequence of videos, which is taken as the input of (2+1)D Resnet of temporal stream network. In the spatial stream, RGB image sequence of videos is set as the input of (2+1)D Resnet. Both spatial and temporal stream CNNs are pre-trained on Kinect 400 to improve the performance. And the prediction results of spatial and temporal streams are combined according to fusion method. In addition, the optimal length of input clip is determined by experiments to improve the recognition accuracy further. Experimental results show that owing to the presented developments the accuracy of the proposed end-to-end R(2+1)D-based two-stream CNN on UCF101 has been increased, which is 94.97%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call