Abstract

This paper focuses on exploring distinctive spatio-temporal representation in a self-supervised manner for group activity recognition. Firstly, previous networks treat spatial- and temporal-aware information as a whole, limiting their abilities to represent complex spatio-temporal correlations for group activity. Here, we propose the Spatial and Temporal Attention Heads (STAHs) to extract spatial- and temporal-aware representations independently, which generate complementary contexts for boosting group activity understanding. Then, we propose the Global Spatio-Temporal Contrastive (GSTCo) loss to aggregate these two kinds of features. Unlike previous works focusing on the individual temporal consistency while overlooking the correlations between actors, i.e., in a local perspective, we explore the global spatial and temporal dependency. Moreover, GSTCo could effectively avoid the trivial solution faced in contrastive learning by achieving the right balance between spatial and temporal representations. Furthermore, our method imports affordable overhead during pre-training, without additional parameters or computational costs in inference, guaranteeing efficiency. By evaluating on widely-used datasets for group activity recognition, our method achieves good performance. State-of-the-art performance is achieved when applying our pre-trained backbone to existing networks. Extensive experiments verify the generalizability of our method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.