Abstract
This paper strives for spatio-temporal localization of human actions in videos. In the literature, the consensus is to achieve localization by training on bounding box annotations provided for each frame of each training video. As annotating boxes in video is expensive, cumbersome and error-prone, we propose to bypass box-supervision. Instead, we introduce action localization based on point-supervision. We start from unsupervised spatio-temporal proposals, which provide a set of candidate regions in videos. While normally used exclusively for inference, we show spatio-temporal proposals can also be leveraged during training when guided by a sparse set of point annotations. We introduce an overlap measure between points and spatio-temporal proposals and incorporate them all into a new objective of a multiple instance learning optimization. During inference, we introduce pseudo-points, visual cues from videos, that automatically guide the selection of spatio-temporal proposals. We outline five spatial and one temporal pseudo-point, as well as a measure to best leverage pseudo-points at test time. Experimental evaluation on three action localization datasets shows our pointly-supervised approach (1) is as effective as traditional box-supervision at a fraction of the annotation cost, (2) is robust to sparse and noisy point annotations, (3) benefits from pseudo-points during inference, and (4) outperforms recent weakly-supervised alternatives. This leads us to conclude that points provide a viable alternative to boxes for action localization.
Highlights
This paper aims to recognize and localize actions such as skiing, running, and getting out of a vehicle in videos
A common limitation is the quality of the spatio-temporal proposals themselves; only few proposals have a high overlap with the ground truth, making the (a) localization a needle in the haystack problem regardless of the model
We evaluate the influence of the spatio-temporal proposals upon which our approach is built
Summary
This paper aims to recognize and localize actions such as skiing, running, and getting out of a vehicle in videos. Action recognition has been a vibrant topic in vision for several decades, resulting in approaches based on local spatiotemporal features (Dollár et al 2005; Laptev 2005; Wang et al 2009), dense trajectories (Jain et al 2013; Wang et al 2013) two-stream neural networks (Simonyan and Zisserman 2014; Feichtenhofer et al 2016), 3D convolutions (Ji et al 2013; Tran et al 2015), and recurrent networks (Donahue et al 2015; Li et al 2018; Srivastava et al 2015). We aim to recognize which actions occur in videos, and discover when and where the actions are present.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.