Abstract

Weakly-supervised temporal action localization aims to identify and localize action instances in untrimmed videos using only video-level action labels. Due to the lack of frame-level annotation information, correctly distinguishing foreground and background snippets in a video is crucial for temporal action localization. However, alongside foreground and background snippets, a large number of semantically similar snippets exist within the video. Such snippets share the same semantic information with foreground or background, leading to less fine-grained boundary localization of action instances. Inspired by the success of multimodal learning, we have extracted high-quality semantic features from multimodal inputs and constructed contrast loss to enhance the ability of the model to distinguish semantically similar snippets. In this paper, we propose a fusion detection network with discriminative enhancement(De-FDN). Specifically, we design a fusion detection model (FDM) that fully leverages the complementarity and correlation among multimodal features to extract high-quality semantic features from videos. We then construct multimodal class activation sequences to accomplish accurate identification and localization of action instances. Additionally, we design a discriminative enhancement mechanism (DEM), which increases the gap between semantically similar segments by calculating the semantic contrast loss. Extensive experiments on the THUMOS14, ActivityNet1.2, and ActivityNet1.3 datasets demonstrate the effectiveness of our method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call