Abstract

Video-based human action recognition remains a challenging task. There are three main limitations: (1) Most works are only restricted to single temporal scale modeling. (2) Although a few methods consider multilevel motion features, they disregard the fact that different features usually contribute differently. (3) Most attention mechanisms only notice important regions in frames without concerning the spatial structure information around them. To address these issues, a discriminative multi-focused and complementary temporal/spatial attention framework is presented, which consists of the multi-focused temporal attention network with multi-granularity loss (M2TEAN) and complementary spatial attention network with co-classification loss (C2SPAN). Firstly, M2TEAN not only focuses on discriminative multilevel motion features, but also highlights more discriminative features among them. Specifically, a short-term discriminative attention sub-network and a middle-term consistent attention sub-network are, respectively, constructed to focus on discriminative short-term and middle-term features. A long-term evolutive attention sub-network is proposed to focus on long-term action evolution over time. Followed by a multi-focused temporal attention module, more discriminative features are ulteriorly highlighted. Secondly, C2SPAN captures discriminative regions in frames, while mining the spatial structure information around them. Experiments reveal that our methods produce state-of-the-art results.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call