Predicting Rogue Content and Arabic Spammers on Twitter

Adel R Alharbi,Amer Aljaedi

doi:10.3390/fi11110229

Adel R Alharbi, Amer Aljaedi

Open Access

PDF Available

https://doi.org/10.3390/fi11110229

Copy DOI

Export

Save

Cite

Journal: Future Internet	Publication Date: Oct 30, 2019
Citations: 11	License type: CC BY 4.0

Affiliation: University of Tabuk

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Twitter is one of the most popular online social networks for spreading propaganda and words in the Arab region. Spammers are now creating rogue accounts to distribute adult content through Arabic tweets that Arabic norms and cultures prohibit. Arab governments are facing a huge challenge in the detection of these accounts. Researchers have extensively studied English spam on online social networks, while to date, social network spam in other languages has been completely ignored. In our previous study, we estimated that rogue and spam content accounted for approximately three quarters of all content with Arabic trending hashtags in Saudi Arabia. This alarming rate, supported by autonomous concurrent estimates, highlights the urgent need to develop adaptive spam detection methods. In this work, we collected a pure data set from spam accounts producing Arabic tweets. We applied lightweight feature engineering based on rogue content and user profiles. The 47 generated features were analyzed, and the best features were selected. Our performance results show that the random forest classification algorithm with 16 features performs best, with accuracy rates greater than 90%.

Highlights

Twitter is a platform that allows users to compose messages of 140 characters or less
Using the random forest classifier, we evaluated and compared different numbers of variables, which were randomly sampled as candidates at each split, and we found that the method with 8 variables performed best
The analysis reveals that advanced machine learning algorithms, such as random forest (RF), decision tree (DT), and naïve Bayesian (NB), are more effective compared with simple algorithms such as bag-of-words

Summary

Introduction

Twitter is a platform that allows users to compose messages of 140 characters or less. These messages are known as tweets and can include text, short videos, images, and hyperlinks. Twitter usernames start with the prefix @. Users of Twitter build their social networks by interacting with fans and followers. Tweets generated by users appear on their homepage and the timelines of their followers, and they can be discovered by Twitter’s search engine. The tweet is often relayed continuously, as is the username prefixed by @, which is included in the tweet. All Twitter hashtags are preceded by the hash (#) symbol and can even be found by using Twitter’s search engine [1]

Objectives

Results

Conclusion