Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks

Barack Wanjawa,Lilian Wanzare,Florence Indede,Edward Ombui,Lawrence Muchemi,Owen McOnyango

doi:10.21248/jlcl.36.2023.243

Abstract

Indigenous African languages are categorized as under-served in Natural Language Processing. They therefore experience poor digital inclusivity and information access. The processing challenge with such languages has been how to use machine learning and deep learning models without the requisite data. The Kencorpus project intends to bridge this gap by collecting and storing text and speech data that is good enough for data-driven solutions in applications such as machine translation, question answering and transcription in multilingual communities. The Kencorpus dataset is a text and speech corpus for three languages predominantly spoken in Kenya: Swahili, Dholuo and Luhya (three dialects of Lumarachi, Lulogooli and Lubukusu). Data collection was done by researchers who were deployed to the various data collection sources such as communities, schools, media, and publishers. The Kencorpus' dataset has a collection of 5,594 items, being 4,442 texts of 5.6 million words and 1,152 speech files worth 177 hours. Based on this data, other datasets were also developed such as Part of Speech tagging sets for Dholuo and the Luhya dialects of 50,000 and 93,000 words tagged respectively. We developed 7,537 Question-Answer pairs from 1,445 Swahili texts and also created a text translation set of 13,400 sentences from Dholuo and Luhya into Swahili. The datasets are useful for downstream machine learning tasks such as model training and translation. Additionally, we developed two proof of concept systems: for Kiswahili speech-to-text and a machine learning system for Question Answering task. These proofs provided results of a performance of 18.87% word error rate for the former, and 80% Exact Match (EM) for the latter system. These initial results give great promise to the usability of Kencorpus to the machine learning community. Kencorpus is one of few public domain corpora for these three low resource languages and forms a basis of learning and sharing experiences for similar works especially for low resource languages. Challenges in developing the corpus included deficiencies in the data sources, data cleaning challenges, relatively short project timelines and the Coronavirus disease (COVID-19) pandemic that restricted movement and hence the ability to get the data in a timely manner.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal for Language Technology and Computational Linguistics	Publication Date: Jun 21, 2023
Citations: 2	License type: CC BY-SA 4.0

R Discovery Prime

R Discovery Prime

Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks

Abstract

Talk to us

Similar Papers

More From: Journal for Language Technology and Computational Linguistics

Lead the way for us

Similar Papers

Modeling the Spread of COVID-19 by Leveraging Machine and Deep Learning Models
Muhammad Adnan ... Ala Abdulsalam Alarood
Intelligent Automation & Soft Computing | VOL. 31
Muhammad Adnan, et. al.Muhammad Adnan ... Ala Abdulsalam Alarood
01 Jan 2021
Intelligent Automation & Soft Computing | VOL. 31

Machine Translation and Transliteration Involving Related and Low-resource Languages
Anoop Kunchukuttan ... Pushpak Bhattacharyya
-
Anoop Kunchukuttan, et. al.Anoop Kunchukuttan ... Pushpak Bhattacharyya
18 Jun 2021
18 Jun 2021

A comprehensive review of COVID-19 detection with machine learning and deep learning techniques.
Sreeparna Das ... Deepak Gupta
Health and technology | VOL. 13
Sreeparna Das, et. al.Sreeparna Das ... Deepak Gupta
07 Jun 2023
Health and technology | VOL. 13

Guest Editors Introduction: Machine Learning in Speech and Language Technologies
Pascale Fung ... Dan Roth
Machine Learning | VOL. 60
Pascale Fung, et. al.Pascale Fung ... Dan Roth
01 Sep 2005
Machine Learning | VOL. 60

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks

Abstract

Talk to us

Similar Papers

More From: Journal for Language Technology and Computational Linguistics