Bilingual Lexicon Induction (BLI) aims to induce word translation pairs across different languages. We find there is a large proportion of polysemous words in the dataset used by BLI, which impacts the quality of the induced translation pairs. In particular, the mainstream BLI models map two monolingual word embeddings (WEs) into a shared space and regard the closest cross-lingual words as translation pairs. However, when a word has multiple translations, it becomes challenging to ensure that the polysemous word is the closest neighbor to all its translations. Based on the analysis of polysemy, we propose a simple yet effective method by harnessing polysemous words to enhance BLI. We first detect polysemous words in the seed lexicons and filter their multiple translations by comparing their semantic similarities. Therefore, a refined seed lexicon mitigates the confusion caused by translations with divergent semantics for the BLI model. Moreover, in scenarios where only single translations are available in the seed lexicon, we propose to employ a Large Language Model (LLM) to acquire monolingual synonyms, thereby refining the structure of the seed lexicon more effectively. On benchmark BLI datasets, our method demonstrates superior performance compared to five state-of-the-art (SOTA) baseline systems across almost all experimental language pairs, encompassing both sides of three similar language pairs and three distant language pairs.
Read full abstract