Exploring the Effect of N-grams with BOW and TF-IDF Representations on Detecting Fake News

Amal Esmail Qasem,Mohammad Sajid

doi:10.1109/icdabi56818.2022.10041537

Abstract

The Internet is used by millions of users daily, who publish news content on social media like (Twitter, Facebook, etc.). These platforms are becoming the most significant source of spreading fake news, which plays a significant issue for the individual and society. Fake news is incorrect information written to mislead readers. Fake news' text available on these platforms is unstructured and needs to be preprocessed and converted to a numerical format to be used later. Some fake news has seemed natural, making it challenging even for humans to identify them. Therefore, automated fake news detection tools leveraging machine learning methods have become an essential requirement. This paper investigates and compares two feature extraction approaches, Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), with N-grams, and three conventional machine classifiers, Support Vector Machine (SVM), Naive Bayes (NB), and Decision Tree (DT). In addition, the performance of these models is compared with the fine-tuned BERT transformer model with its feature representation. The experiment was conducted on fake and real news dataset. It is demonstrated that the traditional models are still good candidates and that the use of bigram combined with BOW and DT classifier performs the best among others, with an accuracy of 99.74% compared to existing results and reaching BERT f1-accuracy on this dataset.

Full Text