Abstract

With social media posts tending to be multimodal, Multimodal Named Entity Recognition (MNER) for the text with its accompanying image is attracting more and more attention since it plays an important role for various applications such as intention understanding and user recommendation. However, there are two drawbacks in existing approaches: (1) Meanings of the text and its accompanying image do not match always, so the text information still plays a major role. However, social media posts are usually shorter and more informal compared with other normal contents, which easily causes incomplete semantic description and the data sparsity problem. (2) Although the visual representations are already used, existing methods ignore either fine-grained semantic correspondence between objects in images and words in text or the objective fact that there are misleading objects or no objects in some images . In this work, we solve the above two problems by introducing the multi-granularity cross-modal representation learning . To resolve the first problem, we enhance the representation by semantic augmentation for each word in text. As for the second issue, we perform the cross-modal semantic interaction between text and vision at the different vision granularity to get the most effective multimodal guidance representation for every word . The experiments show that our results on TWITTER-2015 (74.57%) and TWITTER-2017 (86.09%) outperform the current performances. The code, data and the best performing models are available at : https://github.com/LiuPeiP-CS/IIE4MNER.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call