Abstract

Weakly supervised temporal action localization is a challenging task as only the video-level annotation is available during the training process. To address this problem, we propose a two-stage approach to generate high-quality frame-level pseudo labels by fully exploiting multi-resolution information in the temporal domain and complementary information between the appearance (i.e., RGB) and motion (i.e., optical flow) streams. In the first stage, we propose an Initial Label Generation (ILG) module to generate reliable initial frame-level pseudo labels. Specifically, in this newly proposed module, we exploit temporal multi-resolution consistency and cross-stream consistency to generate high quality class activation sequences (CASs), which consist of a number of sequences with each sequence measuring how likely each video frame belongs to one specific action class. In the second stage, we propose a Progressive Temporal Label Refinement (PTLR) framework to iteratively refine the pseudo labels, in which we use a set of selected frames with highly confident pseudo labels to progressively train two networks and better predict action class scores at each frame. Specifically, in our newly proposed PTLR framework, two networks called Network-OTS and Network-RTS, which are respectively used to generate CASs for the original temporal scale and the reduced temporal scales, are used as two streams (i.e., the OTS stream and the RTS stream) to refine the pseudo labels in turn. By this way, multi-resolution information in the temporal domain is exchanged at the pseudo label level, and our work can help improve each network/stream by exploiting the refined pseudo labels from another network/stream. Comprehensive experiments on two benchmark datasets THUMOS14 and ActivityNet v1.3 demonstrate the effectiveness of our newly proposed method for weakly supervised temporal action localization.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call