Abstract

With the emergence of a large number of video resources, video action recognition is attracting much attention. Recently, realizing the outstanding performance of three-dimensional (3D) convolutional neural networks (CNNs), many works have began to apply them for action recognition and obtained satisfactory results. However, high computational over-heads greatly reduce the efficiency of 3D CNNs. To make up for the shortcoming, in this paper, we first propose two innovations — the Xwise Separable Convolution and the SS block, both of which are lightweight. Then we build an efficient 3D CNN called the XwiseNet based on our innovations. Our work aims to make 3D CNNs lightweight without reducing the recognition accuracy. The key idea of the Xwise Separable Convolution is extremely decoupling the 3D convolution in channel, spatial, and temporal dimensions. The SS block can capture temporal long-range dependencies via aggregating sequence-specific global context to each sequence feature. Experiments have verified that our XwiseNet achieves competitive performance with the least computational overhead.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call