In this paper, a contrastive learning approach for morphological disambiguation (MD) using large language models (LLMs) is presented. A contrastive loss function is introduced for training the approach, which reduces the distance between the correct analysis and contextual embeddings while maintaining a margin between correct and incorrect embeddings. One of the aims of the paper is to analyze the effects of fine-tuning an LLM on MD in morphologically complex languages (MCLs) with special reference to low-resource languages such as Kazakh, as well as Turkish. Another goal of the paper is to consider various distance measures for this contrastive loss function, aiming to achieve better results when performing disambiguation by computing the distance between the context and the analysis embeddings. The existing approaches for morphological disambiguation, such as HMM-based and feature-engineering approaches, have limitations in modeling long-term dependencies and in the case of large, sparse tagsets. These challenges are mitigated in the proposed approach by leveraging LLMs, thus achieving better accuracy in handling the cases of ambiguity and OOV tokens without the need to rely on other features. Experiments were conducted on three datasets for two MCLs, Kazakh and Turkish—the former is a typical low-resource language. The results revealed that the proposed approach with contrastive loss improves MD performance when integrated with knowledge from large language models.
Read full abstract