UPC: An Open Word-Sense Annotated Parallel Corpora for Machine Translation Study

Van-Hai Vu,Quang-Phuoc Nguyen,Cheol-Young Ock,Joon-Choul Shin

doi:10.3390/app10113904

Abstract

Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available bilingual language resources. In this research, we built the open extensive parallel corpora for training MT models, named Ulsan parallel corpora (UPC). Currently, UPC contains two parallel corpora consisting of Korean-English and Korean-Vietnamese datasets. The Korean-English dataset has over 969 thousand sentence pairs, and the Korean-Vietnamese parallel corpus consists of over 412 thousand sentence pairs. Furthermore, the high rate of homographs of Korean causes an ambiguous word issue in MT. To address this problem, we developed a powerful word-sense annotation system based on a combination of sub-word conditional probability and knowledge-based methods, named UTagger. We applied UTagger to UPC and used these corpora to train both statistical-based and deep learning-based neural MT systems. The experimental results demonstrated that using UPC, high-quality MT systems (in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score) can be built. Both UPC and UTagger are available for free download and usage.

Highlights

A Machine translation (MT) system that can automatically translate text written in a language into another has been a dream from the beginning of artificial intelligence history
The results present that the word-sense disambiguation (WSD) process significantly improves the quality of all MT systems, and neural MT (NMT) systems give better results than statistical MT (SMT) systems in all parallel corpora
This further demonstrates that the Korean-English and Korean-Vietnamese language pairs are consistent with the popular language pairs (i.e., Arabic, Chinese, English, French, German, Japanese, Russian, and Spanish) where NMT was stated superior to SMT [47,48,49]

Summary

Introduction

A MT system that can automatically translate text written in a language into another has been a dream from the beginning of artificial intelligence history. Chung and Gildea [12] collected the Korean-English alignment sentences from websites and got approximately 60,000 sentence pairs These collected parallel corpora are not public, and their sizes are inefficient to train high-quality MT systems. News Commentary corpus (https://github.com/jungyeul/korean-parallel-corpora.) [14], which was crawled from the CNN and Yahoo websites, contains approximately 97,000 Korean-English sentence pairs. These parallel corpora are publicly available; their sizes are too small to train MT systems. Up to 969 thousand and more than 412 thousand sentence pairs in Korean-English and Korean-Vietnamese, respectively, were obtained These datasets are large enough to train quality MT systems and are available for download at https://github.com/haivv/UPC.

Parallel

The Parallel Corpora Analysis with UTagger

Utilizing the Pre-Analysis Partial Eojeol Dictionary

Using Sub-Word Conditional Probability

Knowledge-Based Approach for WSD

Korean Morphological Analysis and WSD System

Applying Morphological Analysis and Word-Sense Annotation to UPC

Korean-English Parallel Corpus

Korean-Vietnamese Parallel Corpus

Experimentation

Experimental Results

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Jun 4, 2020
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

UPC: An Open Word-Sense Annotated Parallel Corpora for Machine Translation Study

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Baidu Translate: Research and Products
Zhongjun He
-
Zhongjun HeZhongjun He
01 Jan 2015
01 Jan 2015

Low Resource Neural Machine Translation: Assamese to/from Other Indo-Aryan (Indic) Languages
Rupjyoti Baruah ... Rajesh Kumar Mundotiya
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 21
Rupjyoti Baruah, et. al.Rupjyoti Baruah ... Rajesh Kumar Mundotiya
16 Nov 2021
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 21

Framework for Handling Rare Word Problems in Neural Machine Translation System Using Multi-Word Expressions
Kamal Deep Garg ... Bhisham Sharma
Applied Sciences | VOL. 12
Kamal Deep Garg, et. al.Kamal Deep Garg ... Bhisham Sharma
31 Oct 2022
Applied Sciences | VOL. 12

Experimenting with Different Machine Translation Models in Medium-Resource Settings
Haukur Páll Jónsson ... Steinþór Steingrímsson
-
Haukur Páll Jónsson, et. al.Haukur Páll Jónsson ... Steinþór Steingrímsson
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

UPC: An Open Word-Sense Annotated Parallel Corpora for Machine Translation Study

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences