Noise-Robust Semi-supervised Multi-modal Machine Translation

Lin Li,Kaixi Hu,Kong Aik Lee,Jianquan Liu,Turghun Tayir

doi:10.1007/978-3-031-20865-2_12

Abstract

AbstractRecent unsupervised multi-modal machine translation methods have shown promising performance for capturing semantic relationships in unannotated monolingual corpora by large-scale pretraining. Empirical studies show that small accessible parallel corpora can achieve comparable performance gains of large pretraining corpora in unsupervised setting. Inspired by the observation, we think semi-supervised learning can largely reduce the demand of pretraining corpora without performance degradation in low-cost scenario. However, images of parallel corpora typically contain much irrelevant information, i.e., visual noises. Such noises have a negative impact on the semantic alignment between source and target languages in semi-supervised learning, thus weakening the contribution of parallel corpora. To effectively utilize the valuable and expensive parallel corpora, we propose a Noise-robust Semi-supervised Multi-modal Machine Translation method (Semi-MMT). In particular, a visual cross-attention sublayer is introduced into source and target language decoders, respectively. And, the representations of texts are used as a guideline to filter visual noises. Based on the visual cross-attention, we further devise a hybrid training strategy by employing four unsupervised and two supervised tasks to reduce the mismatch between the semantic representation spaces of source and target languages. Extensive experiments conducted on the Multi30k dataset show that our method outperforms the state-of-the-art unsupervised methods with large-scale extra corpora for pretraining in terms of METEOR metric, yet only requires 7% parallel corpora.KeywordsMultimodal dataNeural machine translationSemi-supervised learningNoise

Full Text