The weakly supervised Temporal Action Detection (TAD) by using the video-level annotations can lighten the burden of labor consumption. However, the current methods for weakly supervised TAD do not take full advantages of the short-term consistency between consecutive frames and the long-term continuity inside an action, resulting in less accurate detecting boundaries of actions in untrimmed videos. In this paper, the SuperFrame-based Temporal Proposal (SFTP) is proposed, in which superframes are formed for representing a series of consecutive frames with high temporal consistency and their features are pooled from the features of frames through the integration function. Then, the temporal proposal is built based on the multiple consecutive superframes and the features of all proposals are generated from a pyramidal feature hierarchy. This hierarchy consists of the designed Structured Outer-Inner Context (SOIC) features formed from superframe features and is able to explicitly characterize the temporal continuity inside a proposal. Furthermore, a novel Scale-Wise Normalization Strategy (SWNS) is proposed to identify proposals, which can effectively detect multiple actions with different duration in one untrimmed video. Extensive experiments are conducted on two public datasets: THUMOS14 and ActivityNet1.2 for performance evaluation. Our experimental results have demonstrated that the proposed approach is able to detect the boundaries of actions more effectively and obtain competitive mAP (mean average precision) compared with other approaches.
Read full abstract