Learning Social Relationship From Videos via Pre-Trained Multimodal Transformer

Yiyang Teng,Chenguang Song,Bin Wu

doi:10.1109/lsp.2022.3181849

Abstract

As a crucial task for video analysis, social relation recognition from characters provides intelligent applications with great potential to better understand the behaviors or emotions of human beings. Most existing methods mainly focus on training models from a large amount of labeled data. However, labeling social relations in videos is time-consuming. To solve this problem, we propose a Pre-trained Multimodal Feature Learning (PMFL) framework for self-supervised learning from unlabeled video data, and then transfer the pre-trained PMFL to downstream social relationship recognition task. First, the space-time interaction between visual instances, and cross-modal interaction between visual and textual information provide important cues for social relation understanding. To incorporate these cues, we design a Multimodal Instance Interaction Transformer (MIIT), which consists of two Transformers to capture intra-modal and cross-modal information interaction, respectively. Second, to better endow PMFL with the capability of learning visual and textual semantic features, we pre-train it via two tasks: Masked Action Feature Regression (MAFR) and Masked Object Label Classification (MOLC). These tasks can help learn both intra-modal and cross-modal semantic information. After fine-tuning PMFL from pre-trained parameters, it achieves the state-of-the-art results on a public benchmark.

Full Text