Abstract

Language modelling using neural networks requires adequate data to guarantee quality word representation which is important for natural language processing (NLP) tasks. However, African languages, Swahili in particular, have been disadvantaged and most of them are classified as low resource languages because of inadequate data for NLP. In this article, we derive and contribute unannotated Swahili dataset, Swahili syllabic alphabet and Swahili word analogy dataset to address the need for language processing resources especially for low resource languages. Therefore, we derive the unannotated Swahili dataset by pre-processing raw Swahili data using a Python script, formulate the syllabic alphabet and develop the Swahili word analogy dataset based on an existing English dataset. We envisage that the datasets will not only support language models but also other NLP downstream tasks such as part-of-speech tagging, machine translation and sentiment analysis.

Highlights

  • Language modelling using neural networks requires adequate data to guarantee quality word representation which is important for natural language processing (NLP) tasks

  • We derive and contribute unannotated Swahili dataset, Swahili syllabic alphabet and Swahili word analogy dataset to address the need for language processing resources especially for low resource languages

  • We derive the unannotated Swahili dataset by pre-processing raw Swahili data using a Python script, formulate the syllabic alphabet and develop the Swahili word analogy dataset based on an existing English dataset

Read more

Summary

Language modelling

Natural language processing (NLP) involves using computational techniques to learn, understand and produce human language content [7]. Machine learning and deep learning algorithms have been instrumental in NLP [15] with word embeddings playing an important role in the success of language automation systems These algorithms widely depend on the availability of data. Despite the popularity of the language with a lot of speech and text data, Swahili is still classified under low resource language with limited pre-processed open access data [2,3,4] For this reason, NLP research on Swahili has been limited to the restricted annotated Helsinki dataset [8] that was developed by researchers from Helsinki university in conjunction with university of Nairobi. Given the representation vectors of words A, B and C the vector of D can be derived by XB - XA + XC where Xi represents the word representation vector of word i This test cannot be applied on Swahili language models because of non-existence of a Swahili analogy dataset. Data description This section provides an individual description of each dataset in the following subsections

Unannotated Swahili dataset
Swahili syllabic alphabet
Swahili word analogy dataset
Processing the unannotated Swahili dataset
Processing the Swahili syllabic alphabet
Processing the Swahili word analogy dataset
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call