Abstract
In Korean, spacing is very important to understand the readability and context of sentences. In addition, in the case of natural language processing for Korean, if a sentence with an incorrect spacing is used, the structure of the sentence is changed, which affects performance. In the previous study, spacing errors were corrected using n-gram based statistical methods and morphological analyzers, and recently many studies using deep learning have been conducted. In this study, we try to solve the spacing error correction problem using both the syllable-level and morpheme-level. The proposed model uses a structure that combines the convolutional neural network layer that can learn syllable and morphological pattern information in sentences and the bidirectional long short-term memory layer that can learn forward and backward sequence information. When evaluating the performance of the proposed model, the accuracy was evaluated at the syllable-level, and also precision, recall, and f1 score were evaluated at the word-level. As a result of the experiment, it was confirmed that performance was improved from the previous study.
Highlights
This study defined the problem of correcting Korean word spacing as a sequence labeling problem that sequentially attaches spacing tags to syllables in sentences
This study proposed to use both the syllable level and morpheme level of Korean
A model with a structure combining multiple filter 1D-convolutional neural networks (CNN) and Bi-Long Short-Term Memory (LSTM) is used, and information of syllable-level and morpheme-level is combined in the second half of the model
Summary
Word spacing is the boundary between words that construct a sentence. Text data with spacing errors can affect performance in various natural language processing (NLP). The data composed of morpheme-level was converted to a POS tag at the syllablelevel, and the data were composed using syllable and noun unit n-gram and POS distribution vector as additional features. A method of correcting the word spacing error as additional features. Most previous studies construct the word spacing system using one of the features Most previous studies the in word spacingInsystem using of the of syllables, words, andconstruct morphemes sentences. Addition, theone model of features previousofstudsyllables, words, and morphemes in sentences. We extracted local features of syllables and morphemes combines CNN and Bi-LSTM. Most of the word spacing correction studies use Sejong corpus data. The Sejong corMost of the word spacing correction studies use Sejong corpus data. The collected Sejong corpus and news articles have HTML tags, special characters, etc., which are not necessary to process word spacing. The number of sentences used in this study is 13 million
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.