Feature Engineering Framework to detect Phishing Websites using URL Analysis

N Swapna Goud,Anjali Mathur

doi:10.14569/ijacsa.2021.0120733

Abstract

Phishing is a most popular and dangerous cyber-attack in the world of internet. One of the most common attacks in cyber security is to access the personal information of internet users through “Phishing Website”. The major element through which hacker can do this job is through URL. Hacker creates an almost replica of original URL in which there is a very small difference, generally not revealed without keen observation. By pipelining various machine learning algorithms, the proposed model aims to recognize the important features to classify the URL using a recursive feature elimination process. In this work the data set of various URL records has been collected with 112 features including one target value. In this work a Machine Learning based model is proposed to identify the significant features, used to classify a URL, the wrapper method recursive feature elimination compares different bagging and boosting machine learning approaches .Ensemble algorithms, Bootstrap Aggregation Algorithms, Boosting and stacking algorithms are used for feature selection. The proposed work has five sections: work on the pre-processing phase, finding the relation between the features of the dataset, automatic selection of number of features using Extra Tree Classifier, comparison of the various ensemble algorithm and finally generates the best features for URL analysis. This paper, designs meta learner with XG BOOST classifier as base classifier and achieved an accuracy of 93% Out of 112 features, this model has performed an extensive comparative study on feature selection and identified 29 features as core features by performing URL analysis.

Highlights

IntroductionThe phishing attack can be handled based on source code or URL or image
The world of digital suffers a lot from cyber security attacks
The feature selection is based on their accuracy score and score for every feature is tabulated in Table IV.When the accuracy is calculated for all 111 features, it is observed that accuracy is more i.e., 94 % accuracy when the number of features are 29 and are tabulated in table new

Summary

Introduction

The phishing attack can be handled based on source code or URL or image. This research designed the model based on URL features. These URL features are further classified into 4 sub categories. With the increase of E-commerce applications, cybercrimes are increasing rapidly [1]. To solve this issue, researchers are focusing on the detection of phishing websites using Machine Learning and Deep Learning techniques. This research tries to find the most important attributes that can determine whether it is a phishing website or not. In order to prevent those types of sites, the model compares the every component of the URL to mark it as “Phish Website” [7]

Methods

Results

Conclusion