Machine Learning and Deep Learning for Phishing Email Classification using One-Hot Encoding

Sikha Bagui,Debarghya Nandi,Robert Jamie White,Subhash Bagui

doi:10.3844/jcssp.2021.610.623

Sikha Bagui, Debarghya Nandi + Show 2 more

Open Access

PDF Available

https://doi.org/10.3844/jcssp.2021.610.623

Copy DOI

Export

Save

Cite

Journal: Journal of Computer Science	Publication Date: Jul 1, 2021
Citations: 21	License type: cc-by

Affiliation: University of West Florida

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Representation of text is a significant task in Natural Language Processing (NLP) and in recent years Deep Learning (DL) and Machine Learning (ML) have been widely used in various NLP tasks like topic classification, sentiment analysis and language translation. Until very recently, little work has been devoted to semantic analysis in phishing detection or phishing email detection. The novelty of this study is in using deep semantic analysis to capture inherent characteristics of the text body. One-hot encoding was used with DL and ML techniques to classify emails as phishing or non-phishing. A comparison of various parameters and hyperparameters was performed for DL. The results of various ML models, Naïve Bayes, SVM, Decision Tree, as well as DL models, Convolutional Neural Networks (CNN) and Long Short Term Memory (LSTM), were presented. The DL models performed better than the ML models in terms of accuracy, but the ML models performed better than the DL models in terms of computation time. CNN with Word Embedding performed the best in terms of accuracy (96.34%), demonstrating the effectiveness of semantic analysis in phishing email detection.

Highlights

Phishing email attacks are intelligently crafted social engineering email attacks in which victims are conned by email to websites that impersonate legitimate sites
Proper estimation of hyperparameters is the key to optimizing any Deep Learning (DL) model
From an analysis of the hyperparameters, for the Convolutional Neural Networks (CNN) model, the best model was obtained for a filter size of 7, context window of 100, embedding window of 80 and pooling size of 4

Summary

Introduction

Phishing email attacks are intelligently crafted social engineering email attacks in which victims are conned by email to websites that impersonate legitimate sites. An estimated 269 billion emails are sent every day (Danny, 2020), with about one in every 2,000 being a phishing email, totaling 135 million phishing attacks attempted every day. Since phishing attacks affect millions of internet users (individuals as well as companies) and since APWG is reporting a continuous increase in unique phishing sites, it is becoming extremely important to seek ways to secure ourselves. Existing defense mechanisms need to be greatly improved (Behdad et al, 2012). Behdad et al (2012) points out that improving the defense mechanism is not enough and that systems should be forward looking and intelligent to be able to identify fraudulent activities and prevent them from occurring

Methods

Results

Conclusion