Abstract

Measuring domain relevance of data and identifying or selecting well-fit domain data for machine translation (MT) is a well-studied topic, but denoising is not yet. Denoising is concerned with a different type of data quality and tries to reduce the negative impact of data noise on MT training, in particular, neural MT (NMT) training. This paper generalizes methods for measuring and selecting data for domain MT and applies them to denoising NMT training. The proposed approach uses trusted data and a denoising curriculum realized by online data selection. Intrinsic and extrinsic evaluations of the approach show its significant effectiveness for NMT to train on data with severe noise.

Highlights

  • Data noise is an understudied topic in the machine translation (MT) field

  • Recent research has found that data noise has a bigger impact on neural machine translation (NMT) than on statistical machine translation (Khayrallah and Koehn, 2018), but learning what data quality means in NMT and how to make NMT training robust to data noise remains an open research question

  • We propose an approach to denoising online NMT training

Read more

Summary

Introduction

Data noise is an understudied topic in the machine translation (MT) field. Recent research has found that data noise has a bigger impact on neural machine translation (NMT) than on statistical machine translation (Khayrallah and Koehn, 2018), but learning what data quality (or noise) means in NMT and how to make NMT training robust to data noise remains an open research question.On the other hand, a rich body of MT data research focuses on domain data relevance and selection for domain adaptation purpose. (van der Wees et al, 2017) employ a neuralnetwork version of it and propose a graduallyrefining strategy to dynamically schedule data during NMT training. (Axelrod et al, 2011) introduce a metric for measuring the data relevance to a domain by using n-gram language models (LM). In these methods, a large amount of in-domain data are used to help measure data domain relevance

Methods
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call