Abstract

Weakly-supervised temporal action localization (WTAL) aims to localize the action instances and recognize their categories with only video-level labels. Despite great progress, existing methods suffer from severe action-background ambiguity, which mainly arises from background noise and neglect of non-salient action snippets. To address this issue, we propose a generalized evidential deep learning (EDL) framework for WTAL, called Uncertainty-aware Dual-Evidential Learning (UDEL), which extends the traditional paradigm of EDL to adapt to the weakly-supervised multi-label classification goal with the guidance of epistemic and aleatoric uncertainties of which the former comes from models lacking knowledge, while the latter comes from the inherent properties of samples themselves. Specifically, targeting excluding the undesirable background snippets, we fuse the video-level epistemic and aleatoric uncertainties to measure the interference of background noise to video-level prediction. Then, the snippet-level aleatoric uncertainty is further deduced for progressive mutual learning, which gradually focuses on the entire action instances in an "easy-to-hard" manner and encourages the snippet-level epistemic uncertainty to be complementary with the foreground attention scores. Extensive experiments show that UDEL achieves state-of-the-art performance on four public benchmarks. Our code is available in github/mengyuanchen2021/UDEL.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call