Abstract

One mainstream method in cross-lingual word embeddings is to learn a linear mapping between two monolingual embedding spaces using a training dictionary. Successful linear mappings require isomorphic embedding spaces. However, monolingual embedding spaces are not perfectly isomorphic, and therefore, a linear mapping cannot align them accurately. In this study, we assume that two embedding spaces are composed of near-isomorphic translation pairs (NearITP) and non-isomorphic translation pairs. Owing to the nature of similar substructures, NearITP can make linear mapping work well. Motivated by this, we design a screening strategy to identify NearITP effectively. Based on this strategy, we find that the proportion of NearITP in the commonly used training dictionary is relatively low, leading to sub-optimal results. To address this problem, we propose a general framework that can be combined with any of the mapping methods, which further boosts subsequent mapping. Experimental results demonstrate that our framework is an improvement over existing mapping-based methods, and outperforms state-of-the-art models on two public data sets. Moreover, we show that our framework can be successfully generalized to contextual word embeddings such as multilingual BERT (mBERT), and further enhances the cross-lingual properties of mBERT.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call