Abstract

Disinformation in the form of news articles, also called fake news, is used by multiple actors for nefarious purposes, such as gaining political advantages. A key component for fake news detection is the ability to find similar articles in a large documents corpus, for tracking narrative changes and identifying the root source (patient zero) of a particular piece of information. This paper presents new techniques based on textual and semantic similarity that were adapted for achieving this goal on large datasets of news articles. The aim is to determine which of the implemented text similarity techniques is more suitable for this task. For text similarity, a Locality-Sensitive Hashing is applied on n-grams extracted from text to produce representations that are further indexed to facilitate the quick discovery of similar articles. The semantic textual similarity technique is based on sentence embeddings from pre-trained language models, such as BERT, and Named Entity Recognition. The proposed techniques are evaluated on a collection of Romanian articles to determine their performance in terms of quality of results and scalability. The presented techniques produce competitive results. The experimental results show that the proposed semantic textual similarity technique is better at identifying similar text documents, while the Locality-Sensitive Hashing text similarity technique outperforms it in terms of execution time and scalability. Even if they were evaluated only on Romanian texts and some of them are based on pre-trained models for the Romanian language, the methods that are the basis of these techniques allow their extension to other languages, with few to no changes, provided that there are pre-trained models for other languages as well. As for a cross-lingual setup, more changes are needed along with tests to demonstrate this capability. Based on the obtained results, one may conclude that the presented techniques are suitable to be integrated into a decentralized anti-disinformation platform for fact-checking and trust assessment.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call