Abstract

There has been growing interest among researchers in quality estimation (QE), which attempts to automatically predict the quality of machine translation (MT) outputs. Most existing works on QE are based on supervised approaches using quality-annotated training data. However, QE training data quality scores readily become imbalanced or skewed : QE data are mostly composed of high translation quality sentence pairs but the data lack low translation quality sentence pairs. The use of imbalanced data with an induced quality estimator tends to produce biased translation quality scores with “high” translation quality scores assigned even to poorly translated sentences. To address the data imbalance, this article proposes a simple, efficient procedure called uniformly interpolated balancing to construct more balanced QE training data by inserting greater uniformness to training data. The proposed uniformly interpolated balancing procedure is based on the preparation of two different types of manually annotated QE data: (1) default skewed data and (2) near-uniform data . First, we obtain default skewed data in a naive manner without considering the imbalance by manually annotating qualities on MT outputs. Second, we obtain near-uniform data in a selective manner by manually annotating a subset only, which is selected from the automatically quality-estimated sentence pairs. Finally, we create uniformly interpolated balanced data by combining these two types of data, where one half originates from the default skewed data and the other half originates from the near-uniform data. We expect that uniformly interpolated balancing reflects the intrinsic skewness of the true quality distribution and manages the imbalance problem. Experimental results on an English-Korean quality estimation task show that the proposed uniformly interpolated balancing leads to robustness on both skewed and uniformly distributed quality test sets when compared to the test sets of other non-balanced datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call