Abstract

Efficient action recognition aims to classify a video clip into a specific action category with a low computational cost. It is challenging since the integrated spatial-temporal calculation (e. g., 3D convolution) introduces intensive operations and increases complexity. This paper explores the feasibility of the integration of channel splitting and filter decoupling for efficient architecture design and feature refinement by proposing a novel spatio-temporal collaborative (STC) module. STC splits the video feature channels into two groups and separately learns spatio-temporal representations in parallel with decoupled convolutional operators. Particularly, STC consists of two computation-efficient blocks, i.e., [Formula: see text] and [Formula: see text], where they extract either spatial ( S· ) or temporal ( T· ) features and further refine their features with either temporal ( ·T ) or spatial ( ·S ) contexts globally. The spatial/temporal context refers to information dynamics aggregated from temporal/spatial axis. To thoroughly examine our method's performance in video action recognition tasks, we conduct extensive experiments using five video benchmark datasets requiring temporal reasoning. Experimental results show that the proposed STC networks achieve a competitive trade-off between model efficiency and effectiveness.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call