Abstract

Temporal action localization aims at detecting the temporal intervals of human actions in untrimmed videos. Most previous methods rely on locating and matching the start and end times of actions. However, action boundaries are ambiguous and uncertain in nature, which leads to inaccurate action localization and a lot of false positives. In this paper, we introduce a new framework for temporal action localization. It explicitly models temporal action centers to reduce unreliable action detection results caused by ambiguous action boundaries. Since action centers are highly related to semantic actions, they can be detected more reliably than the conventional action boundaries. As a result, our framework can exclude false positives and promote high-quality proposals. Based on action centers, we propose a triplet feature fusion mechanism. It performs neural message passing among the boundaries and the center as well as contextual regions outside of the proposal to enrich its representation. In addition, we introduce a centerness scoring method to suppress proposals deviating from the centers of action instances. Consequently, our network can retrieve high-quality action proposals and locate actions more precisely. Experimental results show our method outperforms state-of-the-art methods on the THUMOS14 and ActivityNet v1.3 datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call