Multilingual neural machine translation (MNMT) has attracted more and more attention in recent days because it can use a single neural machine translation (NMT) model to translate between multiple languages. As several languages are involved in MNMT, recent studies have shown that using part of these languages rather than all of them to train the model leads to comparable results. However, previous work on this topic mainly focuses on language clustering and features defined by linguists. The semantic relationship and language distance are not fully considered. How to select the most related language pairs to current low-resource pair to optimize the performance of MNMT is still an open question. In this paper, we propose to take language relatedness computation as a ranking problem, where features such as language distance, linguistic typological information and semantic relatedness features are incorporated into a random decision forest to improve the language relatedness evaluation (LRE) for MNMT. Since the model only focuses on monolingual LRE in general cross-lingual natural language processing tasks, we also propose two features related to machine translation (data size and bilingual relatedness) to predict the final language pairs. Experimental results on IWSLT and WMT datasets show that our proposed LRE method can achieve significant improvements compared with other models. We also conducted several groups of experiments on IWSLT and WMT datasets to further evaluate the effectiveness of the proposed method on MNMT. The results show that the MNMT model trained on language pairs predicted by the LRE method outperforms other language selection methods.
Read full abstract