Abstract
Bilingual Lexicon Induction (BLI) aims to induce word translation pairs across different languages. We find there is a large proportion of polysemous words in the dataset used by BLI, which impacts the quality of the induced translation pairs. In particular, the mainstream BLI models map two monolingual word embeddings (WEs) into a shared space and regard the closest cross-lingual words as translation pairs. However, when a word has multiple translations, it becomes challenging to ensure that the polysemous word is the closest neighbor to all its translations. Based on the analysis of polysemy, we propose a simple yet effective method by harnessing polysemous words to enhance BLI. We first detect polysemous words in the seed lexicons and filter their multiple translations by comparing their semantic similarities. Therefore, a refined seed lexicon mitigates the confusion caused by translations with divergent semantics for the BLI model. Moreover, in scenarios where only single translations are available in the seed lexicon, we propose to employ a Large Language Model (LLM) to acquire monolingual synonyms, thereby refining the structure of the seed lexicon more effectively. On benchmark BLI datasets, our method demonstrates superior performance compared to five state-of-the-art (SOTA) baseline systems across almost all experimental language pairs, encompassing both sides of three similar language pairs and three distant language pairs.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.