Abstract

Most temporal action localization methods are usually trained using video data-sets with frame-wise annotations which are expensive and time-consuming to acquire. To alleviate this problem, many weakly supervised temporal action localization methods which only leverage video-level annotations during training are proposed. In this paper, we first analyze three problems of weakly supervised temporal action localization, namely feature similarity, action completeness, and weak annotation. Based on these three problems, we propose a novel network called multi-stage fusion network, which decomposes the problems into three different modules within the network, namely feature, sub-action, and action modules. Specifically, for feature similarity, a Triplet Loss was introduced to ensure the action instances from the same class having similar feature sequences and expand the margin of the action instance from different classes in the feature module. For action completeness, each stage of the sub-action module can discover the different sub-actions. The complete action instances can be localized in the action module by fusing multiple sub-actions from the sub-action module. To alleviate weak annotation, we localize multiple action proposals from multi-stage outputs of the network in the action module and select the action proposals with higher confidence scores as predicted action instances. Extensive experiment results on data-sets Thumos'14 and ActivityNet1.2 demonstrate that our method outperforms the state-of-the-art methods and the average mean Average Precision (mAP) on Thumos'14 is significantly improved from 40.9% to 43.3%.

Highlights

  • Temporal action localization [1] is an essential computer vision task

  • We first formalize the problem of weakly supervised temporal action localization

  • We improve the average mean Average Precision (mAP) from 40.9% to 43.3%

Read more

Summary

Introduction

Temporal action localization [1] is an essential computer vision task. It has many potential application scenarios such as video recommendation and search, video surveillance, human skill evaluation, etc. Supervised methods [2]–[18] have achieved significant improvement in the past few years. These approaches apply fully supervised settings, which requires ground truth temporal boundary annotation for each action instance. Annotating the ground truth frame-wise labels is expensive and time-consuming for a new dataset, since the untrimmed video usually has a long duration with many frames. The video-level labels such as categories of action instances contained in the video

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.