Abstract

South and North Korea both use the Korean language. However, Korean natural language processing (NLP) research has mostly focused on South Korean language. Therefore, existing NLP systems in the Korean language, such as neural machine translation (NMT) systems, cannot properly process North Korean inputs. Training a model using North Korean data is the most straightforward approach to solving this problem, but the data to train NMT models are insufficient. To solve this problem, we constructed a parallel corpus to develop a North Korean NMT model using a comparable corpus. We manually aligned parallel sentences to create evaluation data and automatically aligned the remaining sentences to create training data. We trained a North Korean NMT model using our North Korean parallel data and improved North Korean translation quality using South Korean resources such as parallel data and a pre-trained model. In addition, we propose Korean-specific pre-processing methods, character tokenization, and phoneme decomposition to use the South Korean resources more efficiently. We demonstrate that the phoneme decomposition consistently improves the North Korean translation accuracy compared to other pre-processing methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call