Contextual word disambiguates of Ge'ez language with homophonic using machine learning

Mequanent Degu Belete,Ayodeji Olalekan Salau,Girma Kassa Alitasb,Tigist Bezabh

doi:10.1016/j.amper.2024.100169

Abstract

According to natural language processing experts, there are numerous ambiguous words in languages. Without automated word meaning disambiguation for any language, the development of natural language processing technologies such as information extraction, information retrieval, machine translation, and others are still challenging task. Therfore, this paper presents the development of a word sense disambiguation model for duplicate alphabet words for the Ge'ez language using corpus-based methods. Because there is no wordNet or public dataset for the Ge'ez language, 1010 samples of ambiguous words were gathered. Afterwards, the words were preprocessed and the text was vectorized using bag of words, Term Frequency-Inverse Document Frequency, and word embeddings such as word2vec and fastText. The vectorized texts are then analysed using the supervised machine learning algorithms such Naive Bayes, decision trees, random forests, K-nearest neighbor, linear support vector machine, and logistic regression. Bag of words paired with random forests outperformed all other combinations, with an accuracy of 99.52%. However, when Deep learning algorithms such as Deep neural network and Long Short-Term memory were used for the same dataset, a 100% accuracy was achieved.

Full Text