Named entity recognition for extracting concept in ontology building on Indonesian language using end-to-end bidirectional long short term memory

Joan Santoso,Esther Irawati Setiawan,Christian Nathaniel Purwanto,Eko Mulyanto Yuniarno,Mochamad Hariadi,Mauridhi Hery Purnomo

doi:10.1016/j.eswa.2021.114856

Abstract

Information Extraction has been widely used to extract information from text. Named Entity Recognition (NER) is one of the primary tasks of Information Extraction to extract entities such as person, location, and organization. Extraction from text collection is essential to obtain information from unstructured text. Moreover, Named Entity Recognition is part of ontology building, which is the main objective of this research. Ontology can be built on the basis of a collection of concepts and relation between concepts. Concepts in ontology usually consist of a group of entities and are obtained using Noun Phrase Extraction or Named Entity Recognition. Our main focus in this research is to extract concepts in Ontology Building automatically using Named Entity Recognition. In this paper, Named Entity Recognition was chosen as our approach due to the lack of results from the previous Noun Phrase Extraction works, which is not all nouns obtained are entities. Our proposed methodology for Named Entity Recognition is applying an end-to-end model using Bidirectional Long Short Term Memory (Bi-LSTM). Bi-LSTM is able to perform a sequence classification task by understanding the context of the input. Named Entity Recognition approaches in the previous study uses Part-of-Speech (POS) Tagging in the preprocessing phase by using other tools or models. This Part-of Speech is also used as a feature to improve the performance of Named Entity Recognition. Our proposed methodology provides an end-to-end system that can be used for both POS Tagging and Named Entity Recognition. By using our proposed end-to-end model, no additional tool is needed for Part-of-Speech Tagging. This the advantage of our model compared to other models. Experiments were conducted on news documents that were labeled with four types of entity classes and 35 types of part-of-speech. The target entities that we have extracted in this study are person, location, organization, and miscellaneous. We evaluated the performance of our model using F1-Score. We have achieved the best F1-Score for Part-of-Speech Tagging of 91.79% and Named Entity Recognition of 83.18%.

Full Text