Fully Unsupervised Machine Translation Using Context-Aware Word Translation and Denoising Autoencoder

Shweta Chauhan,Philemon Daniel,Shefali Saxena,Ayush Sharma

doi:10.1080/08839514.2022.2031817

Shweta Chauhan, Philemon Daniel + Show 2 more

https://doi.org/10.1080/08839514.2022.2031817

Copy DOI

Abstract

ABSTRACT Learning machine translation by using only monolingual data sets is a complex task as there are many possible ways to connect or associate target sentences with source sentences. The monolingual word embeddings are linearly mapped on a common shared space through robust learning or adversarial training in an unsupervised way, but these learning techniques have fundamental limitations in translating sentences. In this paper, a simple yet effective method has been proposed for fully unsupervised machine translation that is based on cross-lingual sense to word embedding instead of cross-lingual word embedding and language model. We have utilized word sense disambiguation to incorporate the source language context in order to select the sense of a word more appropriately. A language model for considering target language context in lexical choices and denoising autoencoder for language insertion, deletion, and reordering are integrated. The proposed approach eliminates the problem of noisy target language context due to erroneous word translations. This work takes into account the challenge of homonyms and polysemous words in the case of morphologically rich languages. The experiments performed on English-Hindi and Hindi-English using different evaluation metrics show an improvement of +3 points in BLEU and METEOR-Hindi over the baseline system.

Full Text