Abstract
Visual relationship detection aims to recognize visual relationships in scenes as triplets 〈subject-predicate-object〉. Previous works have shown remarkable progress by introducing multimodal features, external linguistics, scene context, etc. Due to the loss of informative multimodal hyper-relations (i.e. relations of relationships), the meaningful contexts of relationships are not fully captured yet, which limits the reasoning ability. In this work, we propose a Multimodal Similarity Guided Relationship Interaction Network (MSGRIN) to explicitly model the relations of relationships in graph neural network paradigm. In a visual scene, the MSGRIN takes the visual relationships as nodes to construct an adaptive graph and enhances deep message passing by introducing Entity Appearance Reconstruction, Entity Relevance Filtering and Multimodal Similarity Attention. We have conducted extensive experiments on two datasets: Visual Relationship Detection (VRD) and Visual Genome (VG). The evaluation results demonstrate that the proposed MSGRIN has empirically performed more effectively overall.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have