Abstract

Visual information implied by the images in multimodal relation extraction (MRE) usually contains details that are difficult to describe in text sentences. Integrating textual and visual information is the mainstream method to enhance the understanding and extraction of relations between entities. However, existing MRE methods neglect the semantic gap caused by data heterogeneity. Besides, some approaches map the relations between target objects in image scene graphs to text, but massive invalid visual relations introduce noise. To alleviate the above problems, we propose a novel multimodal relation extraction method based on cooperative enhancement of dual-channel visual semantic information (CE-DCVSI). Specifically, to mitigate the semantic gap between modalities, we realize fine-grained semantic alignment between entities and target objects through multimodal heterogeneous graphs, aligning feature representations of different modalities into the same semantic space using the heterogeneous graph Transformer, thus promoting the consistency and accuracy of feature representations. To eliminate the effect of useless visual relations, we perform multi-scale feature fusion between different levels of visual information and textual representations to increase the complementarity between features, improving the comprehensiveness and robustness of the multimodal representation. Finally, we utilize the information bottleneck principle to filter out invalid information from the multimodal representation to mitigate the negative impact of irrelevant noise. The experiments demonstrate that the method achieves 86.08% of the F1 score on the publicly available MRE dataset, which outperforms other baseline methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.