Abstract

Action tube detection is a challenging task as it requires not only to locate action instances in each frame, but also link them in time. Existing action tube detection methods often employ multi-stage pipelines with complex designs and time-consuming linking procedure. In this paper, we present a simple end-to-end action tube detection method, termed as Sparse Tube Detector (STDet). Unlike those dense action detectors, our core idea is to use a set of learnable tube queries and directly decode them into action tubes (i.e., a set of tracked boxes with action label) from video content. This sparse detection paradigm shares several advantages. First, the large number of hand-crafted anchor candidates in dense action detectors is greatly reduced to a small number of learnable tubes, which results in a more efficient detection framework. Second, our learnable tube queries directly attend the whole video content, which endows our method with the capacity of capturing long-range information for action detection. Finally, our action detector is an end-to-end tube detection without requiring the linking procedure, which directly and explicitly predicts the action boundary instead of depending on the linking strategy. Extensive experiments shows that our STDet outperforms the previous state-of-the-art methods on two challenging untrimmed video action detection datasets of UCF101-24 and MultiSports. We hope our method will be an simple end-to-end tube detection baseline and can inspire new ideas in this direction.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call