Using Statistical Machine Translation to Grade Training Data

Andrew Finch,Eiichiro Sumita

doi:10.1109/isuc.2008.20

Abstract

One of the main causes of errors in statistical machine translation are the erroneous phrase pairs that can find their way into the phrase table. These phrases are the result of poor word-to-word alignments during the training of the translation model. These word alignment errors in turn cause errors during the phrase extraction phase, and these erroneous bilingual phrase pairs are then used during the decoding process and appear in the output of the machine translation system. Machine translation training data is never perfect, often bilingual sentence pairs are incorrectly aligned sentence-by-sentence, or these pairs are poor translations of each other due to human error. Even when sentence pairs in the corpus are good translations of each other the translations may not be literal enough to admit to the sort of phrase-by-phrase translation necessary to make good training data for a phrase-based statistical machine translation (SMT) system. This is because such SMT systems operate on the assumption that source can be transformed into target simply by translating phrase-by-phrase with re-ordering. In the real world, many perfectly correct translations are not of this form, and these sentences even though correct translations, make poor training data for training the translation models of a phrase-based SMT system. This paper presents a technique in which preliminary machine translation systems are built with the sole purpose of indicating those sentence pairs in the training corpus that the systems are able to generate using their models, the hypothesis being that these sentence pairs are likely to make good training data for an SMT system of the same type. These sentences are then used to bootstrap a second SMT system, and those sentences identified as good training data are given additional weight during the training process for building the translation models. Using this technique we were able to improve the performance of a Japanese-to-English SMT system by 1.2-1.5 BLEU points on unseen evaluation data.

Full Text