Email-Based Cyberstalking Detection On Textual Data Using Multi-Model Soft Voting Technique Of Machine Learning Approach

Arvind Kumar Gautam,Abhishek Bansal

doi:10.1080/08874417.2022.2155267

Abstract

ABSTRACT In the virtual world, many internet applications are used by a mass of people for several purposes. Internet applications are the basic needs of people in the modern days of lifestyle which are also making habitual society. Like social media, e-mail technology is also more prevalent among people of different categories for personal and official communications. The widespread use of e-mail-based communication is also raising various types of cybercrimes, including cyberstalking. Cyberstalkers also use an e-mail-based approach to harass the victim in the form of cyberstalking. Cyberstalkers utilize several content-wise and intent-wise approaches to target the victim, such as spamming, phishing, spoofing, malicious, defamatory, e-mail bombing, and non-spam e-mails, including sexism, racism, and threatening, and finally, trying to hack the account over e-mail technology. This paper proposed an EBCD model for automatic cyberstalking detection on textual data of e-mail using the multi-model soft voting technique of the machine learning approach. Initially, experimental works were performed to train, test, and validate all classifiers of three model sets on three different labeled datasets. Dataset D1 contains spam, fraudulent, and phishing e-mail subject, dataset D2 contains spam e-mail body text, while dataset D3 contains harassment-related data. After that, trained, tested, and validated classifiers of all model sets were applied as a combined approach to automatically classify the unlabeled e-mails from the user’s mailbox using the multi-model soft voting technique. The proposed EBCD model successfully classifies the e-mails from the user’s mailbox into cyberstalking e-mails, suspicious e-mails (spam and fraudulent), and normal e-mails. In each model set of the EBCD model, several classifiers, namely support vector machine, random forest, naïve bayes, logistic regression, and soft voting, were used. The final decision in classifying the e-mails from the user’s mailbox was taken by the soft voting technique of each model set. The TF-IDF feature extraction method was used with the entire applied machine learning model sets to obtain the feature vectors from the data. Experimental results show that the soft voting technique not only enhances the performance of the e-mail classification task but also supports making the right decision to avoid the wrong classification. Overall performance of the soft voting technique was better than other classifiers, although the performance of the support vector machine was also notable. As per experimental results, the soft voting technique obtained an accuracy of 97.7%, 97.7%, 98.9%, a precision of 97%, 98.3%, 98.6%, recall of 98.3%, 96.5%, 99.1%, f-score of 97.6%, 97.4%, 98.9%, and AUC of 99.4%, 99.7%, 99.9% on dataset D1, D2, and D3 respectively. The average performance of soft voting of each model set on classified e-mails from the user’s mailbox was also notable, with an accuracy of 96.3%, precision of 98.1%, recall of 94%, f-score of 95.9%, and AUC of 96.8%.

Full Text