Users Awareness Prediction of Cyber Security Aspects in Twitter Using Machine Learning Algorithms

Muneer Bani Yassein,Mohammad Shatnawi,Omar Alomari

doi:10.15866/irecap.v11i6.20725

Abstract

Social media websites contain huge amount of Personal Identifiable Information for a tremendous number of users. Network security measures play a vital role in protecting such sensitive information. Unfortunately, such measures are plagued with vulnerabilities at all levels, which are destined to be exploited, especially, when users lack awareness of the many different aspects of cyber security. Security is as strong as its weakest link, and often, its weakest link is an oblivious user. Therefore, the user best interest to become aware of the various aspects of cyber security to keep their sensitive data out of attacker’s reach is in social media. These aspects include privacy, security, social engineering, and different types of network attacks. In this work, a dataset that consists of tweets related to these four aspects has been collected from Twitter website for analysis. The aim is to measure Twitter users' awareness of these aspects and to leverage machine learning in developing a set of recommendations for tuning privacy and network controls. Such recommendations can be utilized by existing and new users alike. To this end, a new model has been built by applying pre-processing techniques, then extracting the features from the dataset using two popular techniques, and finally feeding them into four popular machine learning models which are Multinominal Naïve Bayes, Support Vector Machine, Decision Tree, and Logistic Regression. A two-level classification is used: general category and specific category. General category consists of five labels, i.e. privacy, security, social engineering, several network attacks, and other. In specific category, each one of the general category labels, except "other", is further classified into sub-labels. The results achieved for major category classification have been as follows: the model with Logistic Regression algorithm has given the best results in four evaluation metrics compared with the others. Whereas, for specific category, the results have been as follows. Decision Tree has achieved the highest performance in privacy and security aspects. Support Vector Machine has given the best results in social engineering aspects. Both Logistic Regression and Multinominal Naïve Bayes have outperformed the other algorithms for network attacks aspect. Finally, some recommendations are presented, such as avoiding private information sharing on social media websites, use of strong passwords, use of different passwords for different social media accounts, and using secure http (https) instead of plain http to browse social media networking websites.

Full Text