Named entity recognition of building construction defect information from text with linguistic noise

Kahyun Jeon,Ghang Lee,Seongmin Yang,H David Jeong

doi:10.1016/j.autcon.2022.104543

Abstract

Neither traditional rule-based named entity recognition (NER) nor the latest language models perform well in information extraction from noisy text—the text that contains linguistic errors, slang, loanwords, and jargon. Building defect complaints filed by residents via online systems is a representative example of such noisy text. This paper proposes an NER method for automatically extracting defect information from noisy text using a defect thesaurus and transfer learning. The thesaurus built herein included 1097 defect named entities in 23 categories. The NER performance was tested using 69,750 defect complaints through transfer learning of three representative pre-trained language models: Multilingual Bidirectional Encoder Representations from Transformers (BERT), Korean BERT (KoBERT), and Korean Efficiently Learning an Encoder that Classifies Token Replacements Accurately (KoELECTRA). The proposed method achieved an average F1 score of 91.0% using KoBERT. This NER performance was higher than that of the open benchmark NER performance for clean text (86.1%).

Full Text