Abstract

Dynamic contextual attribute information in the time dimension is the key to fine-grained action recognition. Temporal contextual relationships cannot be captured by conventional 2D CNNs; good local time can be obtained by the 3D CNNs, but the 3D CNNs are computationally intensive and lack capability for global time. A parallel cross-time temporal module-CTM is proposed in this article, which aims to efficiently capture dynamic contextual information of both local and global temporal dimensions. With our study, we think that the 2D CNNs can better mine temporal features to enrich the contextual relationships of temporal dimensions. The CTM can be embedded into any existing 2D CNNs baseline in a plug-and-play manner, yielding a feature framework that can capture complex spatio-temporal modeling (CTNet) with a tiny additional computational cost. In the extensive validation experiments on three datasets(i.e., SomethingV1&V2, Jester, Diving48), both action recognition accuracy and runtime inference speed are obviously better than existing temporal contextual baseline optimization schemes with similar computational cost complexity, when the CTM embedded into any 2D CNNs framework to enhance the baseline.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call