Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages

Veysel Yücesoy,Aykut Koç

doi:10.1145/3282443

Abstract

This study aims to increase the performance of word embeddings by proposing a new weighting scheme for co-occurrence counting. The idea behind this new family of weights is to overcome the disadvantage of distant appearing word pairs, which are indeed semantically close, while representing them in the co-occurrence counting. For high-resource languages, this disadvantage might not be effective due to the high frequency of co-occurrence. However, when there are not enough available resources, such pairs suffer from being distant. To favour such pairs, a weighting scheme based on a polynomial fitting procedure is proposed to shift the weights up for distant words while the weights of nearby words are left almost unchanged. The parameter optimization for new weights and the effects of the weighting scheme are analysed for the English, Italian, and Turkish languages. A small portion of English resources and a quarter of Italian resources are utilized for demonstration purposes, as if these languages are low-resource languages. Performance increase is observed in analogy tests when the proposed weighting scheme is applied to relatively small corpora (i.e., mimicking low-resource languages) of both English and Italian. To show the effectiveness of the proposed scheme in small corpora, it is also shown for a large English corpus that the performance of the proposed weighting scheme cannot outperform the original weights. Since Turkish is relatively a low-resource language, it is demonstrated that the proposed weighting scheme can increase the performance of both analogy and similarity tests when all Turkish Wikipedia pages are utilized as a corpus. The positive effect of the proposed scheme has also been demonstrated in a standard sentiment analysis task for the Turkish language.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing	Publication Date: Jan 9, 2019
Citations: 5

Similar Papers

Cross-lingual offensive speech identification with transfer learning for low-resource languages
Xiayang Shi ... Shaolin Zhu
Computers & Electrical Engineering | VOL. 101
Xiayang Shi, et. al.Xiayang Shi ... Shaolin Zhu
23 May 2022
Computers & Electrical Engineering | VOL. 101

Phonemic variations in similar words of Turkish and Urdu language
Tania Ali Khan
Dil ve Dilbilimi Çalışmaları Dergisi | VOL. -
Tania Ali KhanTania Ali Khan
25 Mar 2021
Dil ve Dilbilimi Çalışmaları Dergisi | VOL. -

Do not neglect related languages: The case of low-resource Occitan cross-lingual word embeddings
Lisa Woller ... Viktor Hangya
-
Lisa Woller, et. al.Lisa Woller ... Viktor Hangya
01 Jan 2020
01 Jan 2020

Part-of-Speech Tagging via Deep Neural Networks for Northern-Ethiopic Languages
Jurgita Kapočiūtė-Dzikienė ... Senait Gebremichael Tesfagergish
Information Technology and Control | VOL. 49
Jurgita Kapočiūtė-Dzikienė, et. al.Jurgita Kapočiūtė-Dzikienė ... Senait Gebremichael Tesfagergish
19 Dec 2020
Information Technology and Control | VOL. 49

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing