Video event classification based on two-stage neural network

Lei Zhang,Xuezhi Xiang

doi:10.1007/s11042-019-08457-5

Abstract

Video stream is a sequence of static frames which can be described as 3D signals consisting of spatial and temporal clues. Simultaneous tackling of these two clues has always been a key problem for video analysis task. This work proposes a two stage neural network for video event classification task. Instead of straightly connecting RNN to CNNs, a two-stage neural network strategy is employed where the first stage can transfer pre-learned object knowledge to video contents by selected anchors in supervised learning way. Through the proposed strategy, the frame sequence is changed into anchor points by mean-max pooling and then classified by transferred CNNs. The second stage can combine temporal information by means of RNN’s ‘deep in time’ ability. Transferred CNNs joined with RNN can handle spatial and temporal information at the same time, which is end-to-end network learning excepted keeping transferred CNNs’ parameters unchanged. Especially, LSTM and GRU in RNN with one layer or two layers are adopted to overcome the gradient disappearance and gradient explosion problems. Experiments on three in-the-wild datasets show that the proposed two-stage network delivers comparable performances with other state-of-the-art approaches, demonstrating its effectiveness for video event classification.

Full Text