Abstract

Great progress has been made in unsupervised bilingual lexicon induction (UBLI) by aligning the source and target word embeddings independently trained on monolingual corpora. The common assumption of most UBLI models is that the embedding spaces of two languages are approximately isomorphic (i.e., similar in geometric structure). Therefore, the performance is bound by the degree of isomorphism, especially on etymologically and typologically distant languages. Near-zero UBLI results have been reported for them. To address this problem, we propose a method to increase the isomorphism based on bilingual word embedding fusion. In particular, the features from the source embeddings are integrated into the target embeddings, and vice versa. Therefore, the resulting structures of source and target embeddings are similar to each other. The method does not require any form of supervision and can be applied to any language pair. On a benchmark dataset of bilingual lexicon induction, our approach can achieve competitive or superior performance compared to the state-of-the-art methods, with particularly strong results being found on distant languages.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call