Abstract

Video stream is a sequence of static frames which can be described as 3D signals consisting of spatial and temporal clues. Simultaneous tackling of these two clues has always been a key problem for video analysis task. This work proposes a two stage neural network for video event classification task. Instead of straightly connecting RNN to CNNs, a two-stage neural network strategy is employed where the first stage can transfer pre-learned object knowledge to video contents by selected anchors in supervised learning way. Through the proposed strategy, the frame sequence is changed into anchor points by mean-max pooling and then classified by transferred CNNs. The second stage can combine temporal information by means of RNN’s ‘deep in time’ ability. Transferred CNNs joined with RNN can handle spatial and temporal information at the same time, which is end-to-end network learning excepted keeping transferred CNNs’ parameters unchanged. Especially, LSTM and GRU in RNN with one layer or two layers are adopted to overcome the gradient disappearance and gradient explosion problems. Experiments on three in-the-wild datasets show that the proposed two-stage network delivers comparable performances with other state-of-the-art approaches, demonstrating its effectiveness for video event classification.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.