Two-Stream Convolution Neural Network with Video-stream for Action Recognition

Wei Dai,Xinyu Zhang,Ming-Ke Gao,Chen Huang,Yimin Chen

doi:10.1109/ijcnn.2019.8851702

Abstract

Recently, as the application of the convolutional neural network in artificial intelligence is becoming increasingly diversified, a growing number of neural network methods are put forward. For example, 3D convolution and two-stream convolution method based on RGB and optical stream are applied to the neural network. Convolutional neural network with 3D convolutional core is able to extract spatio-temporal features directly from a set of video sequences, used for action recognition. Although the 3D convolutional neural network can obtain partial spatio-temporal information, a new ConvNet architecture called CVDN(Combined Video-stream Deep Network) is proposed to extract more spatio-temporal features from video fragments so as to effectively utilize the temporal information in the dataset. We evaluate our method on the UCF-101 dataset and obtain a good result. The following is some details about our method:First, we use pre-trained ResNets models on Kinetics dataset to initialize our training models, training and extracting the video stream features from UCF-101 dataset. Then, optical flow graphs obtained from the UCF-101 dataset, which are the input of the optical stream, are used to extract the optical features. At length, two-stream features are combined and the results are obtained after Softmax layer. When the linear fusion ratio of video stream features and optical stream features is 5:4, CVDN obtains good results. And the accuracy of our method with Resnet-101 achieves 92.2%.

Full Text