Abstract

In this paper, we investigate the problem of group activity recognition by learning semantics-perserving attention and contextual interaction among different people. Conventional methods usually aggregate the features extracted from individual persons by pooling operations, which lack physical meaning and cannot fully explore the contextual information for group activity recognition. To address this, we develop a Semantics-Preserving Teacher-Student (SPTS) networks architecture. Our SPTS networks first learn a Teacher Network in the semantic domain that classifies the word of group activity based on the words of individual actions. Then we design a Student Network in the appearance domain that recognizes the group activity according to the input video. We enforce the Student Network to mimic the Teacher Network in the learning procedure. In this way, we allocate semantics-preserving attention to different people, which is more effective to seek the key people and discard the misleading people, while no extra labelled data are required. Moreover, a group of people inherently lie in a graphbased structure, where the people and their relationship can be regarded as the nodes and edges of a graph respectively. Based on this, we build two graph convolutional modules on both the Teacher Network and the Student Network to reason the dependency among different people. Furthermore, we extend our approach on action segmentation task based on its intermediate features. Experimental results on four datasets for group activity analysis clearly show the superior performance of our method in comparisons with the state-of-the-arts.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call