Abstract

The translation quality estimation (QE) task, particularly the QE as a Metric task, aims to evaluate the general quality of a translation based on the translation and the source sentence without using reference translations. Supervised learning of this QE task requires human evaluation of translation quality as training data. Human evaluation of translation quality can be performed in different ways, including assigning an absolute score to a translation or ranking different translations. In order to make use of different types of human evaluation data for supervised learning, we present a multi-task learning QE model that jointly learns two tasks: score a translation and rank two translations. Our QE model exploits cross-lingual sentence embeddings from pre-trained multilingual language models. We obtain new state-of-the-art results on the WMT 2019 QE as a Metric task and outperform sentBLEU on the WMT 2019 Metrics task.

Highlights

  • The translation quality estimation (QE) task (Fonseca et al, 2019) aims to evaluate the quality of a translation based on the translation and the source sentence without using reference translations

  • Since the QE as a Metric task requires QE models to assign an absolute score to a translation, Direct Assessment (DA) human evaluation data can be straightforwardly used as training data for the QE as a Metric task

  • Multi-task learning of these two closely related tasks enables us to use both DA and Relative Ranking (RR) human evaluation data for training the QE model and improve performance compared to learning these two tasks separately

Read more

Summary

Introduction

The translation quality estimation (QE) task (Fonseca et al, 2019) aims to evaluate the quality of a translation based on the translation and the source sentence without using reference translations. In order to make use of the RR human evaluation data, we propose a multi-task learning QE model that jointly learns two tasks, score a translation and rank two translations. Shimanaka et al (2018); Gupta et al (2015)’s models only learn to score a translation and Guzman et al (2015)’s model only learns to rank two translations while our model jointly learns to score a translation and rank two translations in order to make use of different types of human evaluation data for model training. There are existing QE models (Lo, 2019; Yankovskaya et al, 2019) that do not need the reference translation and perform translation quality estimation based on cross-lingual word/sentence embeddings, but these QE models give relatively poor and unstable results for different language pairs (Ma et al, 2019) while our QE model achieves more robust and better results. Lo (2019); Yankovskaya et al (2019)’s QE models only score a translation while our QE model jointly learns to score a translation and rank two translations via multi-task learning

Our Approach
Settings
Segment-Level Results
System-Level Results
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.