The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model

Sungsoon Jang,Yeseul Cho,Hyeonmin Seong,Taejong Kim,Hosung Woo

doi:10.3390/app14135682

Abstract

Social network services and chatbots are susceptible to personal information leakage while facilitating language learning without time or space constraints. Accurate detection of personal information is paramount in avoiding such leaks. Conventionally named entity recognizers commonly used for this purpose often fail owing to errors of unrecognition and misrecognition. Research in named entity recognition predominantly focuses on English, which poses challenges for non-English languages. By specifying procedures for the development of Korean-based tag sets, data collection, and preprocessing, we formulated directions on the application of entity recognition research to non-English languages. Such research could significantly benefit artificial intelligence (AI)-based natural language processing globally. We developed a personal information tag set comprising 33 items and established guidelines for dataset creation, later converting it into JSON format for AI learning. State-of-the-art AI models, BERT and ELECTRA, were employed to implement and evaluate the named entity recognition (NER) model, which achieved an 0.943 F1-score and outperformed conventional recognizers in detecting personal information. This advancement suggests that the proposed NER model can effectively prevent personal information leakage in systems processing interactive text data, marking a significant stride in safeguarding privacy across digital platforms.

Full Text