Automatic Word Spacing of Korean Using Syllable and Morpheme

Jeong-Myeong Choi,Chan-Young Park,Yu-Seop Kim,Jong-Dae Kim

doi:10.3390/app11020626

Abstract

In Korean, spacing is very important to understand the readability and context of sentences. In addition, in the case of natural language processing for Korean, if a sentence with an incorrect spacing is used, the structure of the sentence is changed, which affects performance. In the previous study, spacing errors were corrected using n-gram based statistical methods and morphological analyzers, and recently many studies using deep learning have been conducted. In this study, we try to solve the spacing error correction problem using both the syllable-level and morpheme-level. The proposed model uses a structure that combines the convolutional neural network layer that can learn syllable and morphological pattern information in sentences and the bidirectional long short-term memory layer that can learn forward and backward sequence information. When evaluating the performance of the proposed model, the accuracy was evaluated at the syllable-level, and also precision, recall, and f1 score were evaluated at the word-level. As a result of the experiment, it was confirmed that performance was improved from the previous study.

Highlights

This study defined the problem of correcting Korean word spacing as a sequence labeling problem that sequentially attaches spacing tags to syllables in sentences
This study proposed to use both the syllable level and morpheme level of Korean
A model with a structure combining multiple filter 1D-convolutional neural networks (CNN) and Bi-Long Short-Term Memory (LSTM) is used, and information of syllable-level and morpheme-level is combined in the second half of the model

Summary

Introduction

Word spacing is the boundary between words that construct a sentence. Text data with spacing errors can affect performance in various natural language processing (NLP). The data composed of morpheme-level was converted to a POS tag at the syllablelevel, and the data were composed using syllable and noun unit n-gram and POS distribution vector as additional features. A method of correcting the word spacing error as additional features. Most previous studies construct the word spacing system using one of the features Most previous studies the in word spacingInsystem using of the of syllables, words, andconstruct morphemes sentences. Addition, theone model of features previousofstudsyllables, words, and morphemes in sentences. We extracted local features of syllables and morphemes combines CNN and Bi-LSTM. Most of the word spacing correction studies use Sejong corpus data. The Sejong corMost of the word spacing correction studies use Sejong corpus data. The collected Sejong corpus and news articles have HTML tags, special characters, etc., which are not necessary to process word spacing. The number of sentences used in this study is 13 million

Word Spacing Correction Model

Integer Encoding

Embedding

Multiple Filter 1-dimensional Convolutional Neural Networks

Bidirectional Long Short-Term Memory

Labeling

Parameters

Metric

Evaluation and Result

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Jan 11, 2021
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Automatic Word Spacing of Korean Using Syllable and Morpheme

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Detecting cyberbullying using deep learning techniques: a pre-trained glove and focal loss technique.
Amr Mohamed El Koshiry ... Tarek Abd El-Hafeez
PeerJ Computer Science | VOL. 10
Amr Mohamed El Koshiry, et. al.Amr Mohamed El Koshiry ... Tarek Abd El-Hafeez
27 Mar 2024
PeerJ Computer Science | VOL. 10

EESD special issue: AI and data‐driven methods in earthquake engineering – (Part 1)
Xinzheng Lu ... Henry Burton
Earthquake Engineering & Structural Dynamics | VOL. 52
Xinzheng Lu, et. al.Xinzheng Lu ... Henry Burton
04 May 2023
Earthquake Engineering & Structural Dynamics | VOL. 52

Waste material classification using performance evaluation of deep learning models
Israa Badr Al-Mashhadani
Journal of Intelligent Systems | VOL. 32
Israa Badr Al-MashhadaniIsraa Badr Al-Mashhadani
09 Nov 2023
Journal of Intelligent Systems | VOL. 32

A span-based model for extracting overlapping PICO entities from randomized controlled trial publications.
Gongbo Zhang ... Yiliang Zhou
Journal of the American Medical Informatics Association : JAMIA | VOL. 31
Gongbo Zhang, et. al.Gongbo Zhang ... Yiliang Zhou
12 Mar 2024
Journal of the American Medical Informatics Association : JAMIA | VOL. 31

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automatic Word Spacing of Korean Using Syllable and Morpheme

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences