Abstract

Neither traditional rule-based named entity recognition (NER) nor the latest language models perform well in information extraction from noisy text—the text that contains linguistic errors, slang, loanwords, and jargon. Building defect complaints filed by residents via online systems is a representative example of such noisy text. This paper proposes an NER method for automatically extracting defect information from noisy text using a defect thesaurus and transfer learning. The thesaurus built herein included 1097 defect named entities in 23 categories. The NER performance was tested using 69,750 defect complaints through transfer learning of three representative pre-trained language models: Multilingual Bidirectional Encoder Representations from Transformers (BERT), Korean BERT (KoBERT), and Korean Efficiently Learning an Encoder that Classifies Token Replacements Accurately (KoELECTRA). The proposed method achieved an average F1 score of 91.0% using KoBERT. This NER performance was higher than that of the open benchmark NER performance for clean text (86.1%).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call