Abstract
Named entity recognition in the Indonesian language has significantly developed in recent years. However, it still lacks standardized publicly available corpora; a small dataset is available but suffers from inconsistent annotations. Therefore, we re-annotated the dataset to improve its consistency and benefit the community. Our re-annotation led to better training results from an effective baseline model consisting of bidirectional long short-term memory and conditional random fields. To fully utilize the limited available data, we utilized better contextualization and transferred external knowledge by exploiting monolingual and multilingual pre-trained language models, such as IndoBERT and XLM-RoBERTa. In addition to the general improvement from the language models, we observed that the monolingual model is more sensitive, while the multilingual ones show advantages in rich morphological knowledge. We also applied cross-lingual transfer learning to utilize high-resource corpora in other languages. We adopted English, Spanish, Dutch, and German as the source languages for the target Indonesian language and found that Dutch plays a special role in the data transfer method due to morphological similarity attributable to historical reasons.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: ACM Transactions on Asian and Low-Resource Language Information Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.