Abstract

Compared to fully manual translation, post-editing (PE) machine translation (MT) output can save time and reduce errors. Automatic word-level quality estimation (QE) aims to predict the correctness of words in MT output and holds great promise to aid PE by flagging problematic output. Quality of QE is crucial, as incorrect QE might lead to translators missing errors or wasting time on already correct MT output. Achieving accurate automatic word-level QE is very hard, and it is currently not known (i) at what quality threshold QE is actually beginning to be useful for human PE, and (ii), how to best present word-level QE information to translators. In particular, should word-level QE visualization indicate uncertainty of the QE model or not? In this paper, we address both research questions with real and simulated word-level QE, visualizations, and user studies, where time, subjective ratings, and quality of the final translations are assessed. Results show that current word-level QE models are not yet good enough to support PE. Instead, quality levels of > 80% F1 are required. For helpful quality levels, a visualization reflecting the uncertainty of the QE model is preferred. Our analysis further shows that speed gains achieved through QE are not merely a result of blindly trusting the QE system, but that the quality of the final translations also improves. The threshold results from the paper establish a quality goal for future word-level QE research.

Highlights

  • Introduction holds great potential in supportingPE but is a very difficult machine learning task: even the Advances in Machine Translation (MT) have made best models currently only achieve F1 scores in MT a key component in many professional transla- the range 60% to 63% (Lee, 2020; Specia et al, tion workflows, where human post-editors identify 2020) depending on the text domain and the underand correct mistakes in raw MT output

  • We extend MMPE’s project file course, this is debatable, but, given that we can- structure for quality estimation (QE): the QE modnot know exactly what a higher quality QE model els’ quality predictions per word of the MT output, would look like, we believe that this simple ap- and a value indicating which visualization mode to proach is a reasonable starting point for estimating use for the segment, are the threshold when word-level QE stops confusing stored in and loaded from a JSON file

  • The selection of the text segments follows the We present our results in 4 categories: (1) subfollowing steps: jectively assessed helpfulness per QE quality, (2)

Read more

Summary

Introduction

PE but is a very difficult machine learning task: even the Advances in Machine Translation (MT) have made best models currently only achieve F1 scores in MT a key component in many professional transla- the range 60% to 63% (Lee, 2020; Specia et al, tion workflows, where human post-editors identify 2020) depending on the text domain and the underand correct mistakes in raw MT output. Experiment with professional translators who are (2019a) won the WMT 2019 word-level QE task presented with word-level QE output of varying (Fonseca et al, 2019) by combining linear, neural, quality, based on state-of-the-art QE systems and and predictor-estimator systems with new transsimulated QE output, with the goal of determining fer learning approaches using BERT (Devlin et al, minimum quality thresholds required to support the 2019) and XLM (Lample and Conneau, 2019) pre-. Since visualization of QE out- the most recent word-level QE shared task of 2020 put might have an impact on the helpfulness (Specia et al, 2020), Lee (2020) proposed an XLM-

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call