Abstract

Tokenization is a significant primary step for the training of the Pre-trained Language Model (PLM), which alleviates the challenging Out-of-Vocabulary problem in the area of Natural Language Processing. As tokenization strategies can change linguistic understanding, it is essential to consider the composition of input features based on the characteristics of the language for model performance. This study answers the question of “Which tokenization strategy enhances the characteristics of the Korean language for the Named Entity Recognition (NER) task based on a language model?” focusing on tokenization, which significantly affects the quality of input features. We present two significant challenges for the NER task with the agglutinative characteristics in the Korean language. Next, we quantitatively and qualitatively analyze the coping process of each tokenization strategy for these challenges. By adopting various linguistic segmentation such as morpheme, syllable and subcharacter, we demonstrate the effectiveness and prove the performance between PLMs based on each tokenization strategy. We validate that the most consistent strategy for the challenges of the Korean language is a syllable based on Sentencepiece.

Highlights

  • Tokenization, the process of segmenting text into sub-unit tokens, is an essential and fundamental step for the Natural Language Processing (NLP) task

  • NIKL Named Entity Recognition (NER) Corpus distributed by the National Institute of Korean Language (NIKL), an institution that establishes the norm for Korean linguistics, AIR & Naver NER Challenge revealed at the Korean Natural Language Processing Competition held by Naver and Changwon University, KMOU NER corpus distributed by Korea Maritime University (KMOU), And KLUE, which stands for Korean Language Understanding Evaluation and was recently released to evaluate the ability of Korean models to understand natural languages

  • This study answers the question of ‘‘which tokenization strategy is optimal in the Korean NER task’’ by two detailed analysis processes, focusing on tokenization applied with various segmentation schemes in the Korean language

Read more

Summary

Introduction

Tokenization, the process of segmenting text into sub-unit tokens, is an essential and fundamental step for the Natural Language Processing (NLP) task. Recent subword tokenization is a powerful method to alleviate the challenging Out-of-Vocabulary (OOV) problem [31], and algorithms such as Byte-Pair Encoding (BPE) [43], Wordpiece [49], or Sentencepiece [20] correspond to this Tokenization using these methods is robust against the OOV problem compared to lexical standard-based tokenization, allowing the model to better capture the semantic and syntactic meaning of words in context by decomposing words into smaller token units. Unlike an inflectional language, which is distinguished from an agglutinative language by its tendency to use an inflectional morpheme as a root to express syntactic or semantic features, its words may contain different morphemes to determine their meanings, but all of these morphemes tend to remain unchanged after their unions

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call