Abstract

Multimodal relation extraction is a critical task in information extraction, aiming to predict the class of relations between head and tail entities from linguistic sequences and related images. However, the current works are vulnerable to less relevant visual objects detected from images and are not able to sufficiently fuse visual information into text pre-trained models. To overcome these problems, we propose a Two-Stage Visual Fusion Network (TSVFN) that employs the multimodal fusion approach in vision-enhanced entity relation extraction. In the first stage, we design multimodal graphs, whose novelty lies mainly in transforming the sequence learning into the graph learning. In the second stage, we merge the transformer-based visual representation into the text pre-trained model by a multi-scale cross-model projector. Specifically, two multimodal fusion operations are implemented inside the pre-trained model respectively. We finally accomplish deep interaction of multimodal multi-structured data in two fusion stages. Extensive experiments are conducted on a dataset (MNRE), our model outperforms the current state-of-the-art method by 1.76%, 1.52%, 1.29%, and 1.17% in terms of accuracy, precision, recall, and F1 score, respectively. Moreover, our model also achieves excellent results under the condition of fewer samples.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.