Code-switching, which is the mixing of words or phrases from multiple, grammatically distinct languages, introduces semantic and syntactic complexities to sentences which complicate automated text classification. Despite code-switching being a common occurrence in informal text-based communication among most bilingual or multilingual users of digital spaces, its use to spread misinformation is relatively less explored. In Kenya, for instance, the use of code-switched Swahili-English is prevalent on social media. Our main objective in this paper was to systematically re- view code-switching, particularly the use of Swahili-English code-switching to spread misinformation on social media in the Kenyan context. Additionally, we aimed at pre-processing a Swahili-English code-switched dataset and developing a misinformation classification model trained on this dataset. We discuss the process we took to develop the code- switched Swahili-English misinformation classification model. The model was trained and tested using the PolitiKweli dataset which is the first Swahili-English code-switched dataset curated for misinformation classification. The dataset was collected from Twitter (now X) social media platform, focusing on text posted during the electioneering period of the 2022 general elections in Kenya. The study experimented with two types of word embeddings - GloVe and FastText. FastText uses character n-gram representations that help generate meaningful vectors for rare and unseen words in the code-switched dataset. We experimented with both the classical machine learning algorithms and deep learning algo- rithms. Bidirectional Long Short-Term Memory Networks (BiLSTM) algorithm showed the best performance with an f-score of 0.89. The model was able to classify code-switched Swahili-English political misinformation text as fake, fact or neutral. This study contributes to recent research efforts in developing language models for low-resource languages.
Read full abstract