Abstract

Weakly supervised group activity recognition (WSGAR) aims at identifying the overall behavior of multiple persons without any fine-grained supervision information (including individual position and action label). Traditional methods usually adopt a person-to-whole way: detect persons via off-the-shelf detectors, obtain person-level features, and integrate into the group-level features for training the classifier. However, these methods are unflexible due to serious reliance on the quality of detectors. To get rid of the detector, recent works learn several prototype tokens from noisy grid features with learnable weights directly, which treat all the local visual information equally and bring in redundant and ambiguous information to some extent. To this end, we propose a novel coarse-fine nested network (CFNN) to coarsely localize the key visual patches of activity and further finely learn the local features, as well as the global features. Specifically, we design a nested interactor (NI) to progressively model the spatiotemporal interactions of the learnable global token. According to the cue of spatial interaction in NI, we localize several key visual patches via a new coarse-grained spatial localizer (CSL). Then, we finally encode these localized visual patches with the help of global spatiotemporal dependency via a new fine-grained spatiotemporal selector (FSS). Extensive experiments on Volleyball and NBA datasets demonstrate the effectiveness of the proposed CFNN compared with the existing competitive methods. Code is available at: https://github.com/gexiaojingshelby/CFNN.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.