Comparable corpora can benefit the development of Neural Machine Translation models, in particular for under-resourced languages. We present a case study centred on the exploitation of a large comparable corpus for Basque-Spanish, created from independently-produced news by the Basque public broadcaster eitb, where we evaluate the impact of different techniques to exploit the original data, in order to complement parallel datasets for this language pair in both translation directions. Two efficient methods for parallel sentence mining are explored, which identified a common core of approximately half of the total number of aligned sentences, each one uniquely identifying valid parallel sentences not captured by the other method. Filtering the data via identification of length-difference outliers proved highly effective to improve the models, as was the use of tags to discriminate between comparable and parallel data in the training corpora. The use of backtranslated data is also evaluated in this work, with results indicating that alignment-based datasets remain the most beneficial, although complementary backtranslations should also be included to fully exploit the available comparable data. Overall, the results in this work demonstrate that this type of data needs to be carefully analysed prior to its use as training data for Neural Machine Translation, since issues such as information imbalance between source and target data can lead to unoptimal results for a given translation pair.
Read full abstract