Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati

Andani Madodonga,Matthew Adendorff,Vukosi Marivate

doi:10.55492/dhasa.v4i01.4449

Abstract

Local/Native South African languages are classified as low-resource languages. As such, it is essential to build the resources for these languages so that they can benefit from advances in the field of natural language processing. In this work, the focus was to create annotated news datasets for the isiZulu and Siswati native languages based on news topic classification tasks and present the findings from these baseline classification models. Due to the shortage of data for these native South African languages, the datasets that were created were augmented and oversampled to increase data size and overcome class classification imbalance. In total, four different classification models were used namely Logistic regression, Naive bayes, XGBoost and LSTM. These models were trained on three different word embeddings namely Bag-Of-Words, TFIDF and Word2vec. The results of this study showed that XGBoost, Logistic Regression and LSTM, trained from Word2vec performed better than the other combinations.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati

Abstract

Talk to us

Similar Papers

More From: Journal of the Digital Humanities Association of Southern Africa (DHASA)

Lead the way for us

Journal: Journal of the Digital Humanities Association of Southern Africa (DHASA)	Publication Date: Jan 26, 2023
License type: cc-by-sa

Similar Papers

Sentiment Analysis in Low-Resource Bangla Text Using Active Learning
Md Afnan Ul Haque ... Ashiqur Rahman
-
Md Afnan Ul Haque, et. al.Md Afnan Ul Haque ... Ashiqur Rahman
17 Dec 2021
17 Dec 2021

Leveraging Vector Space Similarity for Learning Cross-Lingual Word Embeddings: A Systematic Review
Kowshik Bhowmik ... Anca Ralescu
Digital | VOL. 1
Kowshik Bhowmik, et. al.Kowshik Bhowmik ... Anca Ralescu
01 Jul 2021
Digital | VOL. 1

Comparative Analysis of the Performance of the Fasttext and Word2vec Methods on the Semantic Similarity Query of Sirah Nabawiyah Information Retrieval System: A systematic literature review
Etna Syirfa Qorina ... Didin Saepudin
-
Etna Syirfa Qorina, et. al.Etna Syirfa Qorina ... Didin Saepudin
23 Oct 2020
23 Oct 2020

A comprehensive study on sentiment of Bengali text
Md Al-Amin ... Md Saiful Islam
-
Md Al-Amin, et. al.Md Al-Amin ... Md Saiful Islam
01 Feb 2017
01 Feb 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati

Abstract

Talk to us

Similar Papers

More From: Journal of the Digital Humanities Association of Southern Africa (DHASA)