Learning multimodal relationship interaction for visual relationship detection

Zhixuan Liu,Wei-Shi Zheng

doi:10.1016/j.patcog.2022.108848

Zhixuan Liu, Wei-Shi Zheng

https://doi.org/10.1016/j.patcog.2022.108848

Copy DOI

Export

Save

Cite

Journal: Pattern Recognition	Publication Date: Jun 14, 2022
Citations: 7

Affiliation: Sun Yat-sen University

Abstract
Full-Text
Similar Papers

Abstract

Listen

Visual relationship detection aims to recognize visual relationships in scenes as triplets 〈subject-predicate-object〉. Previous works have shown remarkable progress by introducing multimodal features, external linguistics, scene context, etc. Due to the loss of informative multimodal hyper-relations (i.e. relations of relationships), the meaningful contexts of relationships are not fully captured yet, which limits the reasoning ability. In this work, we propose a Multimodal Similarity Guided Relationship Interaction Network (MSGRIN) to explicitly model the relations of relationships in graph neural network paradigm. In a visual scene, the MSGRIN takes the visual relationships as nodes to construct an adaptive graph and enhances deep message passing by introducing Entity Appearance Reconstruction, Entity Relevance Filtering and Multimodal Similarity Attention. We have conducted extensive experiments on two datasets: Visual Relationship Detection (VRD) and Visual Genome (VG). The evaluation results demonstrate that the proposed MSGRIN has empirically performed more effectively overall.

Full Text