Abstract

Group activity recognition (GAR) is an increasingly popular topic in the field of computer vision. Numerous researchers have proposed a range of methods to achieve outstanding recognition performance. However, these methods invariably require fine-grained personal feature extraction and a large network architecture to aggregate individual features or reason person relationships. To mitigate the need for a bloated portfolio of annotations and high training costs, weak supervision has emerged as a promising approach. Under the weak supervision paradigm, only coarse-grained labels are used during network training. Nevertheless, this method poses two key challenges. Firstly, it is limited in its ability to model temporal relationships among individual persons, and secondly, it tends to focus on less relevant information, thereby leading to suboptimal network parameter optimization. Both of these challenges result in erroneous temporal information judgment and training inefficiencies. To address these challenges within the weak supervision paradigm, we propose a novel Temporal Contrastive and Spatial Enhancement Coarse-Grained Network (TCSE-CGN) to solve the GAR problem. TCSE-CGN comprises two simple yet effective streams, namely the Spatial Enhancement Stream and the Temporal Contrastive Stream. After extracting features using only several RGB frames, half of the extracted feature is sent to the Spatial Enhancement Stream for enhancement using an attention mechanism. Consequently, the network automatically learns more representative information. The remaining feature is sent to the Temporal Contrastive Stream, which uses contrastive learning to model temporal relationships among all RGB frame-level features. Specifically, the network is guided to learn the hidden semantic temporal information about inter-frame sequences. Network parameters are optimized using a combination of universe cross-entropy loss and a novel temporal contrastive loss. Comprehensive experiments are conducted on two widely used datasets, namely the Volleyball dataset and the Collective dataset, to demonstrate the effectiveness of TCSE-CGN. Results show that TCSE-CGN performs competitively with other works that require more supervision and a larger architecture.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.