Implementation of Ensemble-Based Prediction Model for Detecting Sybil Accounts in an OSN

Priyanka Roy,Manu Sood

doi:10.1007/978-981-15-5148-2_62

Abstract

AbstractOnline Social Networks (OSNs) are the leading platforms that are being used these days for a variety of social interactions generally aimed at fulfilling the specific needs of different strata of users. Normally, a user is allowed to join these social networks with little or negligible amount of antecedent verification which essentially leads to the coexistence of fake entities with malicious intentions on these social networking websites. A specific category of such accounts is known as the Sybil accounts where a malicious user pretending as an honest user, creates multiple fake identities to manipulate/harm honest users, creating an illusion of the real users in the OSN that these are real identities. In the absence of stringent control mechanisms, it is difficult to identify and remove these malicious accounts. But, as every single interaction on a social media website leaves its digital trace and a huge number of such interactions every day culminates into huge datasets, it is possible to use Machine Learning (ML) techniques to build prediction models for identifying these Sybil accounts. This paper is one such attempt where we have used ML techniques to build prediction models that can predict the presence of Sybil accounts in Twitter datasets. After preprocessing the data in these datasets, we have selected an optimal set of features using one filter method namely Correlation with Heatmap and two wrapper methods namely Recursive Feature Elimination (RFE) and Recursive Feature Elimination with Cross-Validation (RFE-CV). Then using 8 classifiers (SVM, NN, LR, DT, RF, NB, GPC, and KNN) for the classification of accounts in the datasets, we have concluded that the Decision Tree classifier gives the best prediction performance among all these classifiers. Lastly, we have used an ensemble of 6 classifiers (SVM, NN, LR, DT, RF, and KNN) by using Bagging (max voting) to achieve better results. But it can be concluded that due to the inclusion of weak learners like SVM, NN, and GPC in the ensemble, DT has given the best possible prediction outcomes.KeywordsData preprocessingFeature selectionClassifierEnsemble of classifiersBaggingMax votingSybil attack

Full Text