Abstract

A phishing attack is a threat based on fraudulent communication, usually by e-mail, where the cybercriminals, impersonating a trusted person or organization, try to lure and coax a target. Phishing detection approaches that obtain highly representational features from the text of these e-mails are a suitable strategy to counter these threats since these features can be used to train machine learning algorithms, thus generating models able to classify mail samples as phishing or legitimate messages. This paper proposes a multi-stage approach to detect phishing e-mail attacks using natural language processing and machine learning. The proposed multi-stage approach consists of feature engineering within natural language processing, lemmatization, feature selection, feature extraction, improved learning techniques for resampling and cross-validation, and the configuration of hyperparameters. We present two methods of the proposed approach, the first one exploiting the Chi-Square statistics and the Mutual Information to improve the dimensionality reduction, while the second method associates Principal Component Analysis (PCA) and Latent Semantic Analysis (LSA). Both methods handle the problems of the “curse of dimensionality”, the sparsity, and the amount of information that must be obtained from the context in the Vector Space Model (VSM) representation. These methods yield reduced feature sets that, combined with the XGBoost and Random Forest machine learning algorithms, lead to an F1-measure of 100% success rate, for validation tests with the SpamAssassin Public Corpus and the Nazario Phishing Corpus datasets. Even considering just the text in e-mail bodies, the proposed multi-stage phishing detection approach outperforms state-of-the-art schemes for an accredited data set, requiring a much smaller number of features and presenting lower computational cost.

Highlights

  • T He Internet plays a crucial role on the industries and societies worldwide by providing a wide variety of services

  • Gualberto et al.: The Answer is in the Text: Multi-Stage Methods for Phishing Detection based on Feature Engineering

  • March of this year phishing attacks have been launched using the coronavirus disease of 2019 (COVID-19) as their theme. These phishing attacks contain textual compositions including several matters such as the Internet and security technologies, and information related to the COVID-19 pandemics

Read more

Summary

Introduction

T He Internet plays a crucial role on the industries and societies worldwide by providing a wide variety of services. According to [3], in the first quarter of 2020, 75% of all phishing sites use secure sockets layer (SSL) and since mid- VOLUME X, 2020. Gualberto et al.: The Answer is in the Text: Multi-Stage Methods for Phishing Detection based on Feature Engineering. March of this year phishing attacks have been launched using the coronavirus disease of 2019 (COVID-19) as their theme. To convince their targets, these phishing attacks contain textual compositions including several matters such as the Internet and security technologies, and information related to the COVID-19 pandemics

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.