Social relations are ubiquitous and form the basis of social structure in our daily life. However, existing studies mainly focus on recognizing social relations from still images and movie clips, which are different from real-world scenarios. For example, movie-based datasets define the task as the video classification, only recognizing one relation in the scene. In this article, we aim to study the problem of social relation recognition in an open environment. To close the gap, we provide the first video dataset collected from real-life scenarios, named social relation in the wild (SRIW), where the number of people can be huge and vary, and each pair of relations needs to be recognized. To overcome new challenges, we propose a spatio-temporal relation graph convolutional network (STRGCN) architecture, utilizing correlative visual features to recognize social relations intuitively. Our method decouples the task into two classification tasks: person-level and pair-level relation recognition. Specifically, we propose a person behavior and character module to encode moving and static features in two explicit ways. Then we take them as node features to build a relation graph with meaningful edges in a scene. Based on the relation graph, we introduce the graph convolutional network (GCN) and local GCN to encode social relation features which are used for both recognitions. Experimental results demonstrate the effectiveness of the proposed framework, achieving 83.1% and 40.8% mAP in person-level and pair-level classification. Moreover, the study also contributes to the practicality in this field.
Read full abstract