Abstract

For a long time, learning spatiotemporal features with deep neural networks has been a difficult task in the field of computer vision. In this paper, we present a novel deep architecture, termed as Bifurcated Convolutional Neural Network (BifurcatedNet) to learn the discriminative video representation in an end-to-end manner. In our work, the BifurcatedNet is built upon the stacking bifurcated blocks that aim at simultaneously capturing the static appearance information and the temporal dynamic from input data. Specifically, the bifurcated block is composed of two separated branch, i.e., an appearance branch and a dynamic branch. The appearance branch employs 2D convolutional operation to obtain the spatial responses of image pixels or filters of each input frame, while the design of the dynamic branch is based on the spatio-temporal convolutional operation to exploit the temporal dynamic between pixels and filter response across multiple frames. Multiple experiments are conducted on two popular action recognition benchmarks: UCF101 and HMDB51. With only RGB input, the BifurcatedNet obtains the superior performance over the existing state-of-the-art models under the same experimental setting. The proposed BifurcatedNet is also implemented in a two-stream fashion by using both RGB and optical flow input, and still achieves the state-of-the-art performance, demonstrating the effectiveness of the network design. Furthermore, in order to evaluate the generalization ability, we conduct experiments on the Chalearn LAP IsoGD dataset and find that our model works well in gesture recognition tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call