The Analysis of the Sepedi-English Code-switched Radio News Corpus

Simon Ramalepe,Thipe I Modipa,Marelie H Davel

doi:10.55492/dhasa.v4i01.4444

Abstract

Code-switching is a phenomenon that occurs mostly in multilingual countries where multilingual speakers often switch between languages in their conversations. The unavailability of large-scale code-switched corpora hampers the development and training of language models for the generation of code-switched text. In this study, we explore the initial phase of collecting and creating Sepedi-English code-switched corpus for generating synthetic news. Radio news and the frequency of code-switching on read news were considered and analysed. We developed and trained a Transformer-based language model using the collected code-switched dataset. We observed that the frequency of code-switched data in the dataset was very low at 1.1 %. We complemented our dataset with the news headlines dataset to create a new dataset. Although the frequency was still low, the model obtained the optimal loss rate of 2,361 with an accuracy of 66 %.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The Analysis of the Sepedi-English Code-switched Radio News Corpus

Abstract

Talk to us

Similar Papers

More From: Journal of the Digital Humanities Association of Southern Africa (DHASA)

Lead the way for us

Journal: Journal of the Digital Humanities Association of Southern Africa (DHASA)	Publication Date: Jan 26, 2023
License type: cc-by-sa

Similar Papers

Lattice-based Data Augmentation for Code-switching Speech Recognition
Roland Hartanto ... Koichi Shinoda
-
Roland Hartanto, et. al.Roland Hartanto ... Koichi Shinoda
07 Nov 2022
07 Nov 2022

The Generalization and Robustness of Transformer-Based Language Models on Commonsense Reasoning
Ke Shen
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Ke ShenKe Shen
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Topic-Controlled Text Generation
Cansen Caglayan ... Murat Karakaya
-
Cansen Caglayan, et. al.Cansen Caglayan ... Murat Karakaya
15 Sep 2021
15 Sep 2021

Application of Transformer-Based Language Models to Detect Hate Speech in Social Media
Swapnanil Mukherjee ... Sujit Das
Journal of Computational and Cognitive Engineering | VOL. 2
Swapnanil Mukherjee, et. al.Swapnanil Mukherjee ... Sujit Das
17 Dec 2021
Journal of Computational and Cognitive Engineering | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Analysis of the Sepedi-English Code-switched Radio News Corpus

Abstract

Talk to us

Similar Papers

More From: Journal of the Digital Humanities Association of Southern Africa (DHASA)