Towards benchmark datasets for machine learning based website phishing detection: An experimental study

Abdelhakim Hannousse,Salima Yahiouche

doi:10.1016/j.engappai.2021.104347

Abstract

The increasing popularity of the Internet led to a substantial growth of e-commerce. However, such activities have main security challenges primary caused by cyberfraud and identity theft. Therefore, checking the legitimacy of visited web pages is a crucial task to secure costumers’ identities and prevent phishing attacks. The use of machine learning is widely recognized as a promising solution. The literature is rich with studies that use machine learning techniques for website phishing detection. However, their findings are dataset dependent and are far away from generalization. Two main reasons for this unfortunate state are the impracticable replication and absence of appropriate benchmark datasets for fair evaluation of systems. Moreover, phishing tactics are continuously evolving and proposed systems are not following those rapid changes. In this paper, we present a general scheme for building reproducible and extensible datasets for website phishing detection. The aim is to (1) enable comparison of systems adopting different features, (2) overtake the short-lived nature of phishing websites, and (3) keep track of the evolution of phishing tactics. For experimenting the proposed scheme, we start by adopting a refined categorization of website phishing features and we systematically select a total of 87 commonly recognized ones, we categorize them, and we made them subjects for relevance and runtime analysis. We use the collected set of features to build a dataset in light of the proposed scheme. Thereafter, we use a conceptual replication approach to check the genericity of former findings for the built dataset. Specifically, we evaluate the performance of classifiers on individual and combined categories of features, we investigate different combinations of models, and we explore the effects of filter and wrapper methods on the selection of discriminative features. The results show that Random Forest is the most predictive classifier. Features gathered from external services are the most discriminative where features extracted from web page contents are less distinguishing. Besides external service based features, some web page content features are found not suitable for runtime detection. The use of hybrid features provided the best accuracy score of 96.61%. By investigating different feature selection methods, filter-based ranking with incremental removal of less important features improved the performance up to 96.83% better than wrapper methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Towards benchmark datasets for machine learning based website phishing detection: An experimental study

Abstract

Talk to us

Similar Papers

More From: Engineering Applications of Artificial Intelligence

Lead the way for us

Journal: Engineering Applications of Artificial Intelligence	Publication Date: Jun 16, 2021
Citations: 42

Similar Papers

Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods
Ali Ebrahimi ... Kjeld Andersen
BMC Medical Informatics and Decision Making | VOL. 22
Ali Ebrahimi, et. al.Ali Ebrahimi ... Kjeld Andersen
23 Nov 2022
BMC Medical Informatics and Decision Making | VOL. 22

HRV based feature selection for congestive heart failure and normal sinus rhythm for meticulous presaging of heart disease using machine learning
Ritu Aggarwal ... Suneet Kumar
Measurement: Sensors | VOL. 24
Ritu Aggarwal, et. al.Ritu Aggarwal ... Suneet Kumar
19 Nov 2022
Measurement: Sensors | VOL. 24

Entropy-Based Feature Selection Classification Approach for Detecting Phishing Websites
Shahzad Ali ... Muhammad Shahbaz
-
Shahzad Ali, et. al.Shahzad Ali ... Muhammad Shahbaz
01 Dec 2019
01 Dec 2019

A comparison between the wrapper and hybrid methods for feature selection on biology Omics datasets
Chirui Guo ... Qiuhan Li
Journal of Physics: Conference Series | VOL. 1848
Chirui Guo, et. al.Chirui Guo ... Qiuhan Li
01 Apr 2021
Journal of Physics: Conference Series | VOL. 1848

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Towards benchmark datasets for machine learning based website phishing detection: An experimental study

Abstract

Talk to us

Similar Papers

More From: Engineering Applications of Artificial Intelligence