Abstract

Video action recognition task involves modeling spatiotemporal information, and efficiency is critical to capture spatiotemporal dependencies in the video. Most existing models rely on optical flow information to capture the dynamic visual tempos between consecutive video frames. Although impressive performance can be achieved by combining optical flow with RGB, the time-consuming nature of optical flow computation cannot be ignored. Moreover, 3D CNN has successfully modeled spatiotemporal information, yet the enormous computational volume is unsuitable for real-time action recognition. In this letter, we propose a novel lightweight video feature extraction strategy that achieves better recognition performance with lower FLOPs. In particular, we perform convolution on the video cube from three orthogonal angles to learn its appearance and motion features. Compared with the computational volume of 3D CNN, our proposed method is more economical and thus meets the lightweight requirements. Extensive experimental results on public Something Something-V1 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\&amp;$</tex-math></inline-formula> V2 and Diving48 datasets show our approach achieves the state-of-the-art performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call