Pointly-Supervised Action Localization

Pascal Mettes,Cees G M Snoek

doi:10.1007/s11263-018-1120-4

Abstract

This paper strives for spatio-temporal localization of human actions in videos. In the literature, the consensus is to achieve localization by training on bounding box annotations provided for each frame of each training video. As annotating boxes in video is expensive, cumbersome and error-prone, we propose to bypass box-supervision. Instead, we introduce action localization based on point-supervision. We start from unsupervised spatio-temporal proposals, which provide a set of candidate regions in videos. While normally used exclusively for inference, we show spatio-temporal proposals can also be leveraged during training when guided by a sparse set of point annotations. We introduce an overlap measure between points and spatio-temporal proposals and incorporate them all into a new objective of a multiple instance learning optimization. During inference, we introduce pseudo-points, visual cues from videos, that automatically guide the selection of spatio-temporal proposals. We outline five spatial and one temporal pseudo-point, as well as a measure to best leverage pseudo-points at test time. Experimental evaluation on three action localization datasets shows our pointly-supervised approach (1) is as effective as traditional box-supervision at a fraction of the annotation cost, (2) is robust to sparse and noisy point annotations, (3) benefits from pseudo-points during inference, and (4) outperforms recent weakly-supervised alternatives. This leads us to conclude that points provide a viable alternative to boxes for action localization.

Highlights

This paper aims to recognize and localize actions such as skiing, running, and getting out of a vehicle in videos
A common limitation is the quality of the spatio-temporal proposals themselves; only few proposals have a high overlap with the ground truth, making the (a) localization a needle in the haystack problem regardless of the model
We evaluate the influence of the spatio-temporal proposals upon which our approach is built

Summary

Introduction

This paper aims to recognize and localize actions such as skiing, running, and getting out of a vehicle in videos. Action recognition has been a vibrant topic in vision for several decades, resulting in approaches based on local spatiotemporal features (Dollár et al 2005; Laptev 2005; Wang et al 2009), dense trajectories (Jain et al 2013; Wang et al 2013) two-stream neural networks (Simonyan and Zisserman 2014; Feichtenhofer et al 2016), 3D convolutions (Ji et al 2013; Tran et al 2015), and recurrent networks (Donahue et al 2015; Li et al 2018; Srivastava et al 2015). We aim to recognize which actions occur in videos, and discover when and where the actions are present.

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Computer Vision	Publication Date: Sep 11, 2018
Citations: 19	License type: open-access

R Discovery Prime

R Discovery Prime

Pointly-Supervised Action Localization

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Computer Vision

Lead the way for us

Similar Papers

Automatic annotation of human actions in video
Olivier Duchenne ... Francis Bach
-
Olivier Duchenne, et. al.Olivier Duchenne ... Francis Bach
01 Sep 2009
01 Sep 2009

Spot On: Action Localization from Pointly-Supervised Proposals
Pascal Mettes ... Jan C Van Gemert
-
Pascal Mettes, et. al.Pascal Mettes ... Jan C Van Gemert
01 Jan 2015
01 Jan 2015

Recognition of Action in Broadcast Basketball Videos on the Basis of Global and Local Pairwise Representation
Masaki Takahashi ... James J. Little
-
Masaki Takahashi, et. al.Masaki Takahashi ... James J. Little
01 Dec 2013
01 Dec 2013

Fast cascaded action localization in video using frame alignment
Andrei Stoian ... Michel Crucianu
-
Andrei Stoian, et. al.Andrei Stoian ... Michel Crucianu
01 Nov 2014
01 Nov 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Pointly-Supervised Action Localization

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Computer Vision