Fake News Identification in Urdu Tweets Using Machine Learning Models

Inam Ullah Khan,Fida Muhammad Khan,Zahid Iqbal

doi:10.62019/abbdm.v4i1.105

Abstract

There is an increasing number of people who generate and distribute content online, especially via social media platforms, which is primarily responsible for the proliferation of fake information. Fake information can cause controversy and distort people's perspectives, so it needs to be addressed immediately. The goal of this work is to detect false information in Urdu tweets, a difficult task given the language's large user population and particular grammatical difficulties. We offer an all-inclusive machine learning system that reliably classifies tweets in Urdu as legitimate or false. The methodology we use consists of several key steps: preprocessing, which includes normalizing, tokenizing, removing stop words, and stemming to prepare the data for analysis; data collection, which involves compiling and annotating a sizable dataset of Urdu tweets; and feature extraction, which makes use of technique TF-IDF to extract the semantic and syntactic nuances of the language. We investigate various machine learning models, including RNNs and CNNs, and more sophisticated neural networks like SVM, Random Forest, Logistic Regression, Naive Bayes, and Decision Tree to find the most efficient method for resolving this classification problem. The models are put through a rigorous training and assessment process using measures including the F1 score, accuracy, precision, and recall. Furthermore, a thorough examination of their confusion matrices is done. Our study's findings suggest that deep learning models hold much promise for resolving the problem of inaccurate information in Urdu. This opens the door for additional research and the creation of real-time algorithms for spotting false information. The subject of information integrity in Urdu language content is improved by this work, which also sheds light on the applicability of machine learning techniques in many linguistic contexts. Using SVM, Random Forest, Logistic Regression, Naive Bayes, and Decision Tree we achieved accuracies of 74%,91%, 76%, 78%, and 67% respectively. Meanwhile, CNN and RNN are the classifiers with the highest accuracy levels at 91% and 99% respectively. The results demonstrate that the CNN Model achieved 99% highest accuracy in detecting fake news from Urdu Tweets.

Full Text