Abstract

According to natural language processing experts, there are numerous ambiguous words in languages. Without automated word meaning disambiguation for any language, the development of natural language processing technologies such as information extraction, information retrieval, machine translation, and others are still challenging task. Therfore, this paper presents the development of a word sense disambiguation model for duplicate alphabet words for the Ge'ez language using corpus-based methods. Because there is no wordNet or public dataset for the Ge'ez language, 1010 samples of ambiguous words were gathered. Afterwards, the words were preprocessed and the text was vectorized using bag of words, Term Frequency-Inverse Document Frequency, and word embeddings such as word2vec and fastText. The vectorized texts are then analysed using the supervised machine learning algorithms such Naive Bayes, decision trees, random forests, K-nearest neighbor, linear support vector machine, and logistic regression. Bag of words paired with random forests outperformed all other combinations, with an accuracy of 99.52%. However, when Deep learning algorithms such as Deep neural network and Long Short-Term memory were used for the same dataset, a 100% accuracy was achieved.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call