Abstract

As a crucial task for video analysis, social relation recognition from characters provides intelligent applications with great potential to better understand the behaviors or emotions of human beings. Most existing methods mainly focus on training models from a large amount of labeled data. However, labeling social relations in videos is time-consuming. To solve this problem, we propose a Pre-trained Multimodal Feature Learning (PMFL) framework for self-supervised learning from unlabeled video data, and then transfer the pre-trained PMFL to downstream social relationship recognition task. First, the space-time interaction between visual instances, and cross-modal interaction between visual and textual information provide important cues for social relation understanding. To incorporate these cues, we design a Multimodal Instance Interaction Transformer (MIIT), which consists of two Transformers to capture intra-modal and cross-modal information interaction, respectively. Second, to better endow PMFL with the capability of learning visual and textual semantic features, we pre-train it via two tasks: Masked Action Feature Regression (MAFR) and Masked Object Label Classification (MOLC). These tasks can help learn both intra-modal and cross-modal semantic information. After fine-tuning PMFL from pre-trained parameters, it achieves the state-of-the-art results on a public benchmark.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.