Abstract

Unsupervised temporal action localization in untrimmed videos is a challenging and open issue. Existing works focus on the “clustering + localization” framework for unsupervised temporal action localization. However, it heavily relies on features used for clustering and localization, e.g., features implying potential background information would degrade the localization performance. To address this problem, we propose a novel Action-positive Separation Learning (APSL) method. APSL follows a novel “feature separation + clustering + localization” iterative procedure. First, we introduce a novel feature separation learning (FSL) module. FSL employs separation learning to identify action and background features in a video, and then refines and removes potential action-negative and background-negative features (hard-to-locate) from the identified features employing contrastive learning, thus obtaining action-positive features (easy-to-locate). Next, in “clustering” step, we apply clustering to the separated action-positive features to obtain action pseudo-labels. In “localization” step, with action pseudo-labels and action-positive features, we employ a temporal action localization module to locate action instance regions, in turn, improving the performance of clustering and FSL. The three steps learn iteratively and reinforce each other during training. Comprehensive evaluations conducted on the THUMOS'14 and ActivityNet v1.2 datasets demonstrate that our method outperforms cutting-edge weakly supervised and unsupervised methods, obtaining state-of-the-art performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call