Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora

Jinyi Zhang,Tadahiro Matsumoto

doi:10.3390/app9102036

Abstract

The translation quality of Neural Machine Translation (NMT) systems depends strongly on the training data size. Sufficient amounts of parallel data are, however, not available for many language pairs. This paper presents a corpus augmentation method, which has two variations: one is for all language pairs, and the other is for the Chinese-Japanese language pair. The method uses both source and target sentences of the existing parallel corpus and generates multiple pseudo-parallel sentence pairs from a long parallel sentence pair containing punctuation marks as follows: (1) split the sentence pair into parallel partial sentences; (2) back-translate the target partial sentences; and (3) replace each partial sentence in the source sentence with the back-translated target partial sentence to generate pseudo-source sentences. The word alignment information, which is used to determine the split points, is modified with “shared Chinese character rates” in segments of the sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) show that the method substantially improves translation performance. We also supply the code (see Supplementary Materials) that can reproduce our proposed method.

Highlights

In recent years, Neural Machine Translation (NMT) has made remarkable achievements [1]
Zero-shot translation is a translation mechanism that uses a single NMT engine to translate between multiple languages, even such low-resource languages for which no direct parallel data were provided during training
We propose a method to augment a parallel corpus by sentence segmentation and synthesis

Summary

Introduction

Neural Machine Translation (NMT) has made remarkable achievements [1]. Zero-shot translation is a translation mechanism that uses a single NMT engine to translate between multiple languages, even such low-resource languages for which no direct parallel data were provided during training. Expanding the size of the training data (parallel corpus) is an effective way to improve the translation performance for NMT in low-resource language pairs. We show that we can improve the NMT system’s translation performance by mixing generated pseudo-parallel sentence pairs into training data with no monolingual data and without changing the neural network architecture. This process makes our approach applicable to different NMT architectures.

Related Work

Neural Machine Translation

ASPEC-JC Corpus

Corpus Augmentation by Sentence Segmentation

Generating Parallel Partial Sentences

Corpus Augmentation by Generated Parallel Partial Sentences

Use of Sentences Not Divided into Partial Sentences

Experiment Settings

Experiment Results and Discussion

Method

Conclusions and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: May 17, 2019
Citations: 28	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Multilingual Neural Translation

-

14 Feb 2020
14 Feb 2020

Reduction of Neural Machine Translation Failures by Incorporating Statistical Machine Translation
Jani Dugonik ... Mirjam Sepesy Maučec
Mathematics | VOL. 11
Jani Dugonik, et. al.Jani Dugonik ... Mirjam Sepesy Maučec
28 May 2023
Mathematics | VOL. 11

Baidu Translate: Research and Products
Zhongjun He
-
Zhongjun HeZhongjun He
01 Jan 2015
01 Jan 2015

Automatic Resource Augmentation for Machine Translation in Low Resource Language: EnIndic Corpus
Anasua Banerjee ... Debajyoty Banik
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. -
Anasua Banerjee, et. al.Anasua Banerjee ... Debajyoty Banik
31 Aug 2023
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences