Exploring Sub-Action Granularity for Weakly Supervised Temporal Action Localization

Binglu Wang,Xun Zhang,Yongqiang Zhao

doi:10.1109/tcsvt.2021.3089323

Abstract

Modeling cross-video relationship is an important issue for the weakly supervised temporal action localization task. To this end, traditional methods operate at the action level and rely on complicated strategies to prepare triplet samples, which only mines the cross-video relationships among three videos from two categories. In this work, we observe that action instances from different categories could exhibit similar motion patterns, i.e. subaction, and propose to operate at the sub-action granularity to elaborately explore cross-video relationships. However, only given video-level category labels, the sub-actions are undefined and not annotated. To tackle this challenge, we represent video features via a group of sub-actions, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i> the sub-action family. Specifically, the sub-action family contains multiple feature vectors, where each vector is in charge of representing a specific sub-action. The sub-action family is shared among all videos in the dataset, while all videos contribute to the learning of the sub-action family. Consequently, we can not only get rid of the complicated sampling strategy but also thoroughly mine cross-video relationships from all available videos in the dataset. To learn feature vectors within the sub-action family, we employ a bottom-up temporal action localization paradigm and introduce an extra top-down branch. The sub-action family is introduced into the top-down branch, and it learns feature vectors via representing raw video features. Moreover, we propose a consistency loss to guide the learning process and a diversity loss to mine distinct sub-actions. Extensive experiments are carried out on three benchmark datasets, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i> THUMOS14, ActivityNet v1.2 and ActivityNet v1.3, and the proposed method builds new high performance.

Full Text