Abstract

In recent years, sequential action recognition has attracted increasingly attention as it requires long-term sequential and compositional reasoning of human actions and object interactions. Existing methods perform reasoning either by using snippets that cover very short consecutive frames or key frames sampled from segments, which take a bias process of local and global temporal information. We also find ad-hoc training and ensembling of two separate networks using existing sampling strategies can easily outperform complex state-of-the-art methods, which reveals the complementary nature of current sampling strategies. Motivated by this observation, we propose a simple yet efficient strategy named <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Dense Segmental Sampling (DSS)</i> and a novel network architecture named <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Temporal Dense Segment Network (TDSN)</i> to capture the complementary information from DSS. Our TDSN achieves excellent results on benchmark action recognition datasets, which not only validate the proposed strategy but also help highlight the importance along this direction for sequential video reasoning.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.