Abstract

In recent years, sequential action recognition has attracted increasingly attention as it requires long-term sequential and compositional reasoning of human actions and object interactions. Existing methods perform reasoning either by using snippets that cover very short consecutive frames or key frames sampled from segments, which take a bias process of local and global temporal information. We also find ad-hoc training and ensembling of two separate networks using existing sampling strategies can easily outperform complex state-of-the-art methods, which reveals the complementary nature of current sampling strategies. Motivated by this observation, we propose a simple yet efficient strategy named <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Dense Segmental Sampling (DSS)</i> and a novel network architecture named <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Temporal Dense Segment Network (TDSN)</i> to capture the complementary information from DSS. Our TDSN achieves excellent results on benchmark action recognition datasets, which not only validate the proposed strategy but also help highlight the importance along this direction for sequential video reasoning.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call