Malicious URL Detection An Evaluation of Feature Extraction and Machine Learning Algorithm

Yichen Wang

doi:10.54097/hset.v23i.3209

Abstract

Cyber attacks are increasing rapidly today, and have a great influence on network security. Many of cyber attacks take place via malicious Uniform Resource Locators (URLs). As a result, various approaches have been developed to detect malicious URLs. One of the most competitive techniques is machine learning and deep learning. However, the detailed techniques concerning feature extraction for URLs and machine learning algorithm are still in the process of development. This paper aims to provide some references for screening out the methods of feature extraction and machine learning algorithm. In the designed experiment, the selected URLs are processed by two different methods of feature extraction, tokenization and vectorization, and lexical feature selection. The resultant constructs two different datasets (data1 and data2) for machine learning. Two traditional learning algorithms (Logistic Regression and SVM) and three ensemble learning algorithms (Random Forest, Gradient Boosting, and Bagging) are adopted as detection model for both datasets. The experimental results demonstrate that the method of tokenization and vectorization for feature extraction, together with ensemble learning algorithms can result in good predictive performance of malicious URL detection.

Full Text