Towards Fighting Cybercrime: Malicious URL Attack Type Detection using Multiclass Classification

Tariro Manyumwa,Phillip Francis Chapita,Shouling Ji,Hanlu Wu

doi:10.1109/bigdata50022.2020.9378029

Abstract

Malicious Uniform Resource Locators (URLs) re-main one of the most common threats to cybersecurity. They are commonly spread through phishing, malware and spam. One popular way to detect malicious URLs is through black-lists. Blacklists maintain records of previously known malicious URL reputations. These lists are however shortcoming when there is need to detect newly generated malicious URLs. For that reason, modern research has resorted to training machine learning algorithms to detect malicious URLs. In this paper, we contributed towards the detection of malicious URLs using URL based features in a multiclass classification setting. We focused on three popular URL attack types which are phishing, spam and malware. Our work can be used as a supplementary tool in new or existing anti-phishing, anti-spam and anti-malware detection platforms. We compared the performance of the following ensemble learners: Extreme Gradient Boosting (XGBoost), Adaptive Boosting (AdaBoost), Light Gradient Boosting (LightGBM) and Categorical Boosting (CatBoost). We evaluated the performance of some URL features that we referred to as our features. These included priority features like Kullback-Leibler Divergence (KL divergence), bag of words segmentation and other word-based features. Results showed that our features performed better when compared to experiments we conducted without our features. We trained these algorithms on 126 983 URLs from benchmark datasets and all four learners returned an overall accuracy above 0.95.

Full Text