Abstract
Today, the rapid dissemination of information on digital platforms has seen the emergence of information pollution such as misinformation, disinformation, fake news, and different types of propaganda. Information pollution has become a serious threat to the online digital world and has posed several challenges to social media platforms and governments around the world. In this article, we propose Propaganda Spotting in Online Urdu Language (ProSOUL) - a framework to identify content and sources of propaganda spread in the Urdu language. First, we develop a labelled dataset of 11,574 Urdu news to train the machine learning classifiers. Next, we develop the Linguistic Inquiry and Word Count (LIWC) dictionary to extract psycho-linguistic features of Urdu text. We evaluate the performance of different classifiers by varying n-gram, News Landscape (NELA), Word2Vec, and Bidirectional Encoder Representations from Transformers (BERT) features. Our results show that the combination of NELA, word n-gram, and character n-gram features outperform with 0.91 accuracy for Urdu text classification. In addition, Word2Vec embedding outperforms BERT features in classification of the Urdu text with 0.87 accuracy. Moreover, we develop and classify large scale Urdu content repositories to identify web sources spreading propaganda. Our results show that ProSOUL framework performs best for propaganda detection in the online Urdu news content compared to the general web content. To the best of our knowledge, this is the first study on the detection of propaganda content in the Urdu language.
Highlights
Recent developments in artificial intelligence, big data, and natural language generation are a double-edged sword
Our evaluation shows the failure of n-gram features in the classification of data from unseen sources compared to News Landscape (NELA), Bidirectional Encoder Representations from Transformers (BERT), and Word2Vec features
A detailed analysis of classifiers with n-gram, NELA, Word2Vec, and BERT features shows the best performance with 0.91 accuracy for the combination of word n-gram, character n-gram, and NELA features
Summary
Recent developments in artificial intelligence, big data, and natural language generation are a double-edged sword. Applications like text summarization [1], chatbots [2], and automated journalism [3] are assisting humans. These technologies have become effective tools for the generation and dissemination of misinformation. The growth of misinformation in online content and its amplification by social media platforms are posing several critical challenges to society. Fake news and various propaganda techniques are serious threats to democracy [4], journalism [5], health [6], economy [7], and climate change [8]. The propaganda is an expression of opinion or action by individuals or groups deliberately designed to influence the opinions or actions of other individuals or groups concerning predetermined ends [9].
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have