Abstract
This paper proposed a novel pretext task to address the skeleton-based video representation for self-supervised action recognition tasks. Instead of exploiting only the whole body, various levels of the skeleton structure (e.g., upper body, lower body, left arm, left leg, right arm, right leg, and torso) are employed to extract essential coarser-grained characteristics. This involves computing statistical representations like motion, orientation, trajectory, and magnitude shift from unlabeled skeleton configurations. Then a learning model is built and trained to yield these statistical representations given the sequence configuration as the input. Our approach is question-driven, where each question acts as a puzzle piece contributing to a deeper understanding of the skeleton joint configuration. It's inspired by the ability of the cognitive system observed in individuals to hypothesize unseen actions. This is accomplished by posing pertinent questions and envisioning plausible scenarios to recognize the actions taking place. The answers to these devised questions are derived from the statistical representation of skeleton configurations. To this end, we made 44 questions designed to encompass the broadest overview to the finest detail. Our experiments on the NTU RGB-D, NW-UCLA, and PKU-MMD datasets demonstrate outstanding results in action recognition, proving the superiority of our approach in learning discriminative characteristics.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.