Enhancing African low-resource languages: Swahili data for language modelling.

Casper S Shikali,Refuoe Mokhosi

doi:10.1016/j.dib.2020.105951

Abstract

Language modelling using neural networks requires adequate data to guarantee quality word representation which is important for natural language processing (NLP) tasks. However, African languages, Swahili in particular, have been disadvantaged and most of them are classified as low resource languages because of inadequate data for NLP. In this article, we derive and contribute unannotated Swahili dataset, Swahili syllabic alphabet and Swahili word analogy dataset to address the need for language processing resources especially for low resource languages. Therefore, we derive the unannotated Swahili dataset by pre-processing raw Swahili data using a Python script, formulate the syllabic alphabet and develop the Swahili word analogy dataset based on an existing English dataset. We envisage that the datasets will not only support language models but also other NLP downstream tasks such as part-of-speech tagging, machine translation and sentiment analysis.

Highlights

Language modelling using neural networks requires adequate data to guarantee quality word representation which is important for natural language processing (NLP) tasks
We derive and contribute unannotated Swahili dataset, Swahili syllabic alphabet and Swahili word analogy dataset to address the need for language processing resources especially for low resource languages
We derive the unannotated Swahili dataset by pre-processing raw Swahili data using a Python script, formulate the syllabic alphabet and develop the Swahili word analogy dataset based on an existing English dataset

Summary

Language modelling

Natural language processing (NLP) involves using computational techniques to learn, understand and produce human language content [7]. Machine learning and deep learning algorithms have been instrumental in NLP [15] with word embeddings playing an important role in the success of language automation systems These algorithms widely depend on the availability of data. Despite the popularity of the language with a lot of speech and text data, Swahili is still classified under low resource language with limited pre-processed open access data [2,3,4] For this reason, NLP research on Swahili has been limited to the restricted annotated Helsinki dataset [8] that was developed by researchers from Helsinki university in conjunction with university of Nairobi. Given the representation vectors of words A, B and C the vector of D can be derived by XB - XA + XC where Xi represents the word representation vector of word i This test cannot be applied on Swahili language models because of non-existence of a Swahili analogy dataset. Data description This section provides an individual description of each dataset in the following subsections

Unannotated Swahili dataset

Swahili syllabic alphabet

Swahili word analogy dataset

Processing the unannotated Swahili dataset

Processing the Swahili syllabic alphabet

Processing the Swahili word analogy dataset