Recognizing Characters and Relationships from Videos via Spatial-Temporal and Multimodal Cues

Chenyu Cao,Bin Wu,Fangtao Li,Zheng Wang,Zihe Liu,Chenghao Yan

doi:10.1109/ickg52313.2021.00032

Abstract

Video contains rich semantic knowledge of multiple modalities related to a person. Mining deep or potential semantic knowledge in the video could help artificial intelligence better understand the behavior and emotion of humans in the video. The researches for deep and context semantic knowledge in the video are few at present. Many researches on the knowledge mining of characters and visual relationships between humans still remain on static picture, lacking attention to the temporal visual features and other important modalities. In order to better mine the semantic knowledge in the video, we propose the novel Global-local VLAD (GL-VLAD) module, using the convolution of different scales to enlarge different receptive fields and extract the global and local information of features in the video. In addition, we propose a Multimodal Fusion Graph(MFG) to focus on the knowledge of different modalities, which can represent the general features in multi-modal video scenes. We use this method to conduct a large number of experiments of social relation extraction and person recognition on the dataset MovieGraphs and IQIYI- VID-2019. The accuracy and mAP respectively reach 90.23% and 89.87% on IQIYI-VID-2019. The accuracy achieves 56.13 % on the fine-grained dataset MovieGraphs for relation extraction task, while the person recognition of which has values 89.31 % and 85.24% on accuracy and mAP. The experimental results show that our proposed method has better performance than the state-of-the-art methods.

Full Text