Abstract
Although the existing Multimodal Named Entity Recognition (MNER) methods have achieved promising performance, they suffer from the following drawbacks in social media scenarios. Firstly, most existing methods are based on a strong assumption that the textual content and the associated images are matched, which is not always valid in real scenarios; Secondly, current methods fail to filter out modality-specific random noise, which impedes models from exploiting modality-shared features. In this paper, a novel multi-task multimodal learning architecture is put forward, which aims to improve Multimodal Named Entity Recognition (MNER) performance by cross-modal auxiliary tasks (CMAT). Specifically, we first separate the shared and task-specific features for the main task and auxiliary tasks respectively, which is accomplished by cross-modal gate-control mechanism. Subsequently, without extra pre-processing or annotations, we utilize the cross-modal matching to address the issue of mismatched image-text pairs, and the cross-modal mutual information maximization to optimize the most relevant cross-modal features. Moreover, experimental results on the two widely used datasets confirm the superiority of our proposed approach.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.