Abstract

Current state-of-the-art action classification methods aggregate space---time features globally, from the entire video clip under consideration. However, the features extracted may in part be due to irrelevant scene context, or movements shared amongst multiple action classes. This motivates learning with local discriminative parts, which can help localise which parts of the video are significant. Exploiting spatio-temporal structure in the video should also improve results, just as deformable part models have proven highly successful in object recognition. However, whereas objects have clear boundaries which means we can easily define a ground truth for initialisation, 3D space---time actions are inherently ambiguous and expensive to annotate in large datasets. Thus, it is desirable to adapt pictorial star models to action datasets without location annotation, and to features invariant to changes in pose such as bag-of-feature and Fisher vectors, rather than low-level HoG. Thus, we propose local deformable spatial bag-of-features in which local discriminative regions are split into a fixed grid of parts that are allowed to deform in both space and time at test-time. In our experimental evaluation we demonstrate that by using local space---time action parts in a weakly supervised setting, we are able to achieve state-of-the-art classification performance, whilst being able to localise actions even in the most challenging video datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.