Evaluation of Word and Sub-word Embeddings for isiZulu on Semantic Relatedness and Word Sense Disambiguation Tasks

Sibonelo Dlamini,Edgar Jembere,Anban Pillay

doi:10.1109/icabcd49160.2020.9183836

Abstract

Morphologically rich languages, such as isiZulu have a large number of surface words due to their highly productive (agglutinative) nature. This results in Natural Language Processing (NLP) models learnt from training corpora failing to generalize to the large number of words that would not have be seen in the training data. Some researchers believe that the use of morphemes for NLP for morphologically rich languages results in better models. This belief is based on two premises; (i) morphemes are the most basic meaning bearing units of a language (ii) the space of morphemes is much smaller than the space of word-forms, hence models based on morphemes are more likely to generalise better to unseen words than word-based models. In this paper we investigate the veracity of these premises by comparing morpheme-level embeddings to word-level embeddings through (i) a semantic relatedness task (ii) a Word Sense Disambiguation (WSD) task. The results obtained showed that morpheme-level embeddings were outperformed by word-level embeddings for the semantic relatedness task, but they fared much better than their word level counterpart on the WSD task.

Full Text