Abstract

Phishing attacks are one of the emerging and devastating cyber-attacks. Email is now considered as an appropriate means of written correspondence. Phishing emails are emails intended to extract sensitive and confidential information from the receiver. Two prong presentations are the aims of this paper; first, the analysis to remove the Phishing features from the email, and second, a notable triumph to reduce the complex data set to a lower dimension with different features. The feature reduction process is based on classification and prediction accuracy. The training and testing dataset is a collection of 1500 data tuples from the SPAMASSASIAN corpus. The validation dataset is made by retrieving the emails from Gmail users. To decide the two class codes, PHISHING and HAM, a total of 2000 emails are used to train, test, and validate the data tuples. In this research, 1000 emails are used to train, 500 emails are used to test the data tuples and validate them. First, the dataset is preprocessed to parse the data using HTML Parsing, Data Cleaning, Stemming, Stop Word Elimination, and Tokenization. By reading each email iteratively from the dataset, the features are extracted. Classifier ensemble strategies have gained the attention of many researchers in the machine learning research community in recent years []. Three machine learning classification algorithms are applied to predict the PHISHING and HAM emails such as decision trees (J48), random forest, and logistic regression. It was found that the random forest algorithm works best to separate PHISHING and HAM emails with the precision of a 99% classifier. With 15 feature sets, it fits best, the accuracy of training and validation is calculated to be 95.6% and 99.4%, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call