Abstract

The problem of automatic detection of fake news in social media, e.g., on Twitter, has recently drawn some attention. Although, from a technical perspective, it can be regarded as a straight-forward, binary classification problem, the major challenge is the collection of large enough training corpora, since manual annotation of tweets as fake or non-fake news is an expensive and tedious endeavor, and recent approaches utilizing distributional semantics require large training corpora. In this paper, we introduce an alternative approach for creating a large-scale dataset for tweet classification with minimal user intervention. The approach relies on weak supervision and automatically collects a large-scale, but very noisy, training dataset comprising hundreds of thousands of tweets. As a weak supervision signal, we label tweets by their source, i.e., trustworthy or untrustworthy source, and train a classifier on this dataset. We then use that classifier for a different classification target, i.e., the classification of fake and non-fake tweets. Although the labels are not accurate according to the new classification target (not all tweets by an untrustworthy source need to be fake news, and vice versa), we show that despite this unclean, inaccurate dataset, the results are comparable to those achieved using a manually labeled set of tweets. Moreover, we show that the combination of the large-scale noisy dataset with a human labeled one yields more advantageous results than either of the two alone.

Highlights

  • In recent years, fake news shared on social media has become a much recognized topic [1,2,3]

  • Since the original training set was labeled by source, not by tweet, the first setting evaluates how well the approach performs on the task of identifying fake news sources, whereas the second setting evaluates how well the approach performs on the task of identifying fake news tweets—which was the overall goal of this work

  • It is important to note that the fake news tweets come from sources that have not been used in the training dataset

Read more

Summary

Introduction

Fake news shared on social media has become a much recognized topic [1,2,3]. In the aftermath of the 2021 election, fake news campaigns claiming election fraud were detected [6] These examples show that methods for identifying fake news are a relevant research topic. Classification of tweets has been used for different use cases, most prominently sentiment analysis [9], and by type (e.g., news, meme, etc.) [10], or relevance for a given topic [11]. While sentiment or topic can be more labeled, by less experienced crowd workers [9,12], labeling a news tweet as fake or non-fake news requires a lot more research, and may be a non-trivial task. A preliminary study underlying this paper was published at the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) [15]

Related Work
Datasets
Large-Scale Training Dataset
Small-Scale Evaluation Dataset
Evaluation Scenarios
Approach
User-Level Features
Tweet-Level Features
Text Features
Topic Features
Sentiment Features
Feature Scaling and Selection
Learning Algorithms and Parameter Optimization
Evaluation
Setting 1
Setting 2
Comparison to Training on Manually Labeled Data
Feature Importance
Conclusions and Outlook
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call