Abstract

Group activity recognition aims at understanding the overall activity involved multi-person interaction with each other as a whole in the complex variations in both spatial and temporal transition in untrimmed video. While remarkably significance in applications, existing methods are often impractical for the following two reasons: (1) they rely on heavy annotations and off-the-shelf object detectors even in test stage; (2) they are trained to predict a closed-set predetermined group categories, which limits their generality to wider unseen categories. Motivated by this, we propose a novel zero-shot weakly supervised group activity recognition neural model, ZSTGroupCLIP, which is not only free from the restriction of heavy annotations, but also captures the generalizable elements that are vital for open-set exploration. Specifically, our model based on visual–textual joint representation learning seeks discriminative visual cues for aligning both the group vision and category text branches by multi-level neural prompt. Moreover, to further learn group-related contextual representations under weakly supervision, we learn layer-wise prompts across early layer of group encoder for progressive context modeling. Through extensive experimental validations and comparisons to several baseline methods on two benchmarks, Volleyball and NBA datasets, our method achieves outstanding performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call