MFBFST: Building a stable ensemble learning model using multivariate filter-based feature selection technique for detection of suspicious URL

Sanjukta Mohanty,Arup Abhinna Acharya

doi:10.1016/j.procs.2023.01.145

Sanjukta Mohanty, Arup Abhinna Acharya

Open Access

https://doi.org/10.1016/j.procs.2023.01.145

Copy DOI

Abstract

The overwhelming growth and popularity of web motivated the cyber attackers to develop fraudulent web sites and execute numerous attacking strategies to trick the user revealing sensitive information, installing malware automatically through drive-by-download attack, stealing the identity and money etc. Most of the attacking strategies are spreading through compromised URL (Uniform Resource Locator) and it greatly influences the internet performance. Blacklisting approach is adopted to mitigate these issues, but the demerits of this approach is unable to identify the zero-day attacking patterns and is computationally expensive as it has to search the URLs from a large pool of database. So, to plan an effective detection framework for identifying suspicious web sites is a tedious task. To overcome these issues, we have proposed a suspicious URL detection technique for selecting the most influential significant features for classifying the URL as safe or malignant with the help of multivariate filter-based feature selection technique (MFBFST) of machine learning. The redundant features are eliminated by correlation feature selection technique and the significance of relevant attributes are tested with statistical t-test to obtain the most significant features that has more impact on the prediction of malicious websites. The relevant features obtained from MFBFST are used for evaluating the machine learning algorithms like Bagging, Adaboost, GBoost and kNN (k Nearest Neighbour) to create an efficient model for predicting the malicious web sites efficiently. To assess the effectiveness of MFBFST in the enhancement of prediction of classification results we have evaluated the classifier with and without considering the FST also tested the scalability issues by considering two publicly available datasets. Our implementation results demonstrated that with utilizing the CFS, the machine learning algorithms accomplished the highest classification accuracy of 97% in dataset I and 99.25% in dataset II. We have also compared our proposed approach with existing studies in terms of classification accuracy and found our proposed approach achieved significant improvement in detecting the malignant URL.

Full Text