A Closer Look at Video Sampling for Sequential Action Recognition

Yu Zhang,Siya Mi,Zhengjie Chen,Xin Geng,Junjie Zhao,Hongyuan Zhu

doi:10.1109/tcsvt.2023.3274108

Yu Zhang, Siya Mi + Show 4 more

https://doi.org/10.1109/tcsvt.2023.3274108

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

In recent years, sequential action recognition has attracted increasingly attention as it requires long-term sequential and compositional reasoning of human actions and object interactions. Existing methods perform reasoning either by using snippets that cover very short consecutive frames or key frames sampled from segments, which take a bias process of local and global temporal information. We also find ad-hoc training and ensembling of two separate networks using existing sampling strategies can easily outperform complex state-of-the-art methods, which reveals the complementary nature of current sampling strategies. Motivated by this observation, we propose a simple yet efficient strategy named <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Dense Segmental Sampling (DSS)</i> and a novel network architecture named <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Temporal Dense Segment Network (TDSN)</i> to capture the complementary information from DSS. Our TDSN achieves excellent results on benchmark action recognition datasets, which not only validate the proposed strategy but also help highlight the importance along this direction for sequential video reasoning.

Full Text