Although the existing Multimodal Named Entity Recognition (MNER) methods have achieved promising performance, they suffer from the following drawbacks in social media scenarios. Firstly, most existing methods are based on a strong assumption that the textual content and the associated images are matched, which is not always valid in real scenarios; Secondly, current methods fail to filter out modality-specific random noise, which impedes models from exploiting modality-shared features. In this paper, a novel multi-task multimodal learning architecture is put forward, which aims to improve Multimodal Named Entity Recognition (MNER) performance by cross-modal auxiliary tasks (CMAT). Specifically, we first separate the shared and task-specific features for the main task and auxiliary tasks respectively, which is accomplished by cross-modal gate-control mechanism. Subsequently, without extra pre-processing or annotations, we utilize the cross-modal matching to address the issue of mismatched image-text pairs, and the cross-modal mutual information maximization to optimize the most relevant cross-modal features. Moreover, experimental results on the two widely used datasets confirm the superiority of our proposed approach.
Read full abstract