Abstract

Multilingual neural machine translation (MNMT) models are theoretically attractive for low- and zero-resource language pairs with the impact of cross-lingual knowledge transfer. Existing approaches mainly focus on English-centric directions and always underperform compared to their pivot-based counterparts for non-English directions. In this work, we aim to build a many-to-many MNMT system with an emphasis on the quality of non-English directions by exploring selective and aligned online data augmentation algorithms. Based on our findings showing that the augmented synthetic samples are not “the more, the better” we propose selective online back-translation (SOBT) and thoroughly study different selection criteria to pick suitable samples for training. Furthermore, we boost SOBT with cross-lingual online substitution (CLOS) to align token representations and encourage transfer learning. Our intuition is based on the hypothesis that a universal cross-lingual representation leads to a better multilingual translation performance, especially for non-English directions. Comparing to previous state-of-the-art many-to-many MNMT models and conventional pivot-based methods, experiments on IWSLT2014 and OPUS-100 translation benchmarks show that our approach achieves a competitive or even better performance on English-centric directions and achieves up to ∼12 BLEU for non-English directions. All of our models and codes are publicly available.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call