Joint Multimodal Entity and Relation Extraction (JMERE), which needs to combine complex image information to extract entity-relation quintuples from text sequences, posts higher requirements of the model’s multimodal feature fusion and selection capabilities. With the advancement of large pre-trained language models, existing studies focus on improving the feature alignments between textual and visual modalities. However, there remains a noticeable gap in capturing the temporal information present in textual sequences. In addition, these methods exhibit a certain deficiency in distinguishing irrelevant images when integrating image and text features, making them susceptible to interference from image information unrelated to the text. To address these challenges, we propose a temporally enhanced and similarity-gated attention network (TESGA) for joint multimodal entity relation extraction. Specifically, we first incorporate an LSTM-based Text Temporal Enhancement module to enhance the model’s ability to capture temporal information from the text. Next, we introduce a Text-Image Similarity-Gated Attention mechanism, which controls the degree of incorporating image information based on the consistency between image and text features. Subsequently, We design the entity and relation prediction module using a form-filling approach based on entity and relation types, and conduct prediction of entity-relation quintuples. Notably, apart from the JMERE task, our approach can also be applied to other tasks involving text-visual enhancement, such as Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction (MRE). To demonstrate the effectiveness of our approach, our model is extensively experimented on three benchmark datasets and achieves state-of-the-art performance. Our code will be available upon paper acceptance.11https://github.com/vacuum-cup/TESGA.
Read full abstract