Abstract

Synthetic training data has been extensively used to train Automatic Post-Editing (APE) models in many recent studies because the quantity of human-created data has been considered insufficient. However, the most widely used synthetic APE dataset, eSCAPE, overlooks respecting the minimal editing property of genuine data, and this defect may have been a limiting factor for the performance of APE models. This article suggests adapting back-translation to APE to constrain edit distance, while using stochastic sampling in decoding to maintain the diversity of outputs, to create a new synthetic APE dataset, <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">RESHAPE</b> . Our experiments show that (1) RESHAPE contains more samples resembling genuine APE data than eSCAPE does, and (2) using RESHAPE as new training data improves APE models’ performance substantially over using eSCAPE.

Highlights

  • M ACHINE Translation (MT) has been developed to produce high-quality translations and is being used in various areas

  • In contrast with the phrase-based statistical MT (PBSMT) dataset, the edit-distance distribution of the neural MT (NMT) dataset is drastically skewed (Fig. 2): most of the samples have only a small number of errors, and the output probabilities of back-Automatic Post-Editing (APE) are likely to be concentrated on just a few candidates, possibly making not much different distribution between pure sampling and top-k sampling

  • Our quantitative results reveal that Reverse-Edited Synthetic Hypotheses for Automatic PostEditing" (RESHAPE) is better than eSCAPE

Read more

Summary

INTRODUCTION

M ACHINE Translation (MT) has been developed to produce high-quality translations and is being used in various areas. These models typically take a source text (src) and its MT output (mt) simultaneously as their inputs and take the post-edited text (pe) as their target Training those models requires triplet data, called an APE triplet, that has the form of ⟨src, mt, pe⟩ (Fig. 1). The correction patterns observed in this synthetic data may differ from those occurring in genuine data, and this violation results in a significant discrepancy in the distribution of edit distance between eSCAPE and genuine data (Fig. 2), possibly limiting the APE performance. To solve this problem, we propose a new synthetic APE data-generation scheme that uses parallel corpora. Experimental results demonstrate that, compared to eSCAPE, does our method improve the APE performance, but it produces more samples with similar characteristics to genuine data

PRELIMINARY
MAXIMUM-A-POSTERIORI DECODING
SAMPLING METHOD
RESTRICTED SAMPLING METHOD
EXAMINATION OF BACK-APE DECODING SCHEMES
DISCUSSION
Decoding methods
VIII. CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.