The Answer is in the Text: Multi-Stage Methods for Phishing Detection Based on Feature Engineering

Eder Souza Gualberto,Claudio Gottschalg Duque,Rafael Timoteo De Sousa,Joao Paulo Carvalho Lustosa Da Costa,Thiago Pereira De Brito Vieira

doi:10.1109/access.2020.3043396

Eder Souza Gualberto, Claudio Gottschalg Duque + Show 3 more

Open Access

https://doi.org/10.1109/access.2020.3043396

Copy DOI

Abstract

A phishing attack is a threat based on fraudulent communication, usually by e-mail, where the cybercriminals, impersonating a trusted person or organization, try to lure and coax a target. Phishing detection approaches that obtain highly representational features from the text of these e-mails are a suitable strategy to counter these threats since these features can be used to train machine learning algorithms, thus generating models able to classify mail samples as phishing or legitimate messages. This paper proposes a multi-stage approach to detect phishing e-mail attacks using natural language processing and machine learning. The proposed multi-stage approach consists of feature engineering within natural language processing, lemmatization, feature selection, feature extraction, improved learning techniques for resampling and cross-validation, and the configuration of hyperparameters. We present two methods of the proposed approach, the first one exploiting the Chi-Square statistics and the Mutual Information to improve the dimensionality reduction, while the second method associates Principal Component Analysis (PCA) and Latent Semantic Analysis (LSA). Both methods handle the problems of the “curse of dimensionality”, the sparsity, and the amount of information that must be obtained from the context in the Vector Space Model (VSM) representation. These methods yield reduced feature sets that, combined with the XGBoost and Random Forest machine learning algorithms, lead to an F1-measure of 100% success rate, for validation tests with the SpamAssassin Public Corpus and the Nazario Phishing Corpus datasets. Even considering just the text in e-mail bodies, the proposed multi-stage phishing detection approach outperforms state-of-the-art schemes for an accredited data set, requiring a much smaller number of features and presenting lower computational cost.

Highlights

T He Internet plays a crucial role on the industries and societies worldwide by providing a wide variety of services
Gualberto et al.: The Answer is in the Text: Multi-Stage Methods for Phishing Detection based on Feature Engineering
March of this year phishing attacks have been launched using the coronavirus disease of 2019 (COVID-19) as their theme. These phishing attacks contain textual compositions including several matters such as the Internet and security technologies, and information related to the COVID-19 pandemics

Summary

Introduction

T He Internet plays a crucial role on the industries and societies worldwide by providing a wide variety of services. According to [3], in the first quarter of 2020, 75% of all phishing sites use secure sockets layer (SSL) and since mid- VOLUME X, 2020. Gualberto et al.: The Answer is in the Text: Multi-Stage Methods for Phishing Detection based on Feature Engineering. March of this year phishing attacks have been launched using the coronavirus disease of 2019 (COVID-19) as their theme. To convince their targets, these phishing attacks contain textual compositions including several matters such as the Internet and security technologies, and information related to the COVID-19 pandemics

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 80	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

The Answer is in the Text: Multi-Stage Methods for Phishing Detection Based on Feature Engineering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

From Feature Engineering and Topics Models to Enhanced Prediction Rates in Phishing Detection
Eder S Gualberto ... Thiago P De B Vieira
IEEE Access | VOL. 8
Eder S Gualberto, et. al.Eder S Gualberto ... Thiago P De B Vieira
01 Jan 2020
IEEE Access | VOL. 8

Spatio-temporal estimation of the daily cases of COVID-19 in worldwide using random forest machine learning algorithm.
Cafer Mert Yeşi̇lkanat
Chaos, solitons, and fractals | VOL. 140
Cafer Mert Yeşi̇lkanatCafer Mert Yeşi̇lkanat
20 Aug 2020
Chaos, solitons, and fractals | VOL. 140

NoFish; Total Anti-Phishing Protection System
Dhanushka Niroshan Atimorathanna ... Tharindu Shehan Ranaweera
-
Dhanushka Niroshan Atimorathanna, et. al.Dhanushka Niroshan Atimorathanna ... Tharindu Shehan Ranaweera
10 Dec 2020
10 Dec 2020

Feature extraction process: A phishing detection approach
Ahmad Abunadi ... Anazida Zainal
-
Ahmad Abunadi, et. al.Ahmad Abunadi ... Anazida Zainal
01 Dec 2013
01 Dec 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Answer is in the Text: Multi-Stage Methods for Phishing Detection Based on Feature Engineering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access