ICKA: An instruction construction and Knowledge Alignment framework for Multimodal Named Entity Recognition

Qingyang Zeng,Minghui Yuan,Jing Wan,Kunfeng Wang,Nannan Shi,Qianzi Che,Bin Liu

doi:10.1016/j.eswa.2024.124867

Abstract

Multimodal Named Entity Recognition (MNER) aims to identify entities of predefined types in text by leveraging information from multiple modalities, most notably textual and visual information. Most efforts concentrate on improving cross-modality attention mechanisms to facilitate guidance between modalities. However, they still suffer from certain limitations: (1) it is difficult to establish a unified representation to bridge the semantic gap among different modalities; (2) mining the implicit relationships between text and image is crucial yet challenging. In this paper, we propose an Instruction Construction and Knowledge Alignment Framework for MNER named ICKA to address these issues. Specifically, we first employ a multi-head cross-modal attention mechanism to obtain the cross-modal fusion representation by fusing features from text–image pairs. Then, we integrate external knowledge from the pre-trained vision-language model (VLM) to facilitate semantic alignment between text and image and obtain inter-modality connections. Next, we construct the multimodal instruction that consists of the modal features and uses the inter-modality connections as a bridge between them. We then integrate the instruction into the language model to effectively incorporate multimodal knowledge. Finally, we perform sequence labeling using a Conditional Random Fields (CRF) decoder with a gating mechanism. The proposed method achieves F1 scores of 75.42% on the Twitter2015 dataset and 87.12% on the Twitter2017 dataset, demonstrating the competitiveness of our method.

Full Text