Investigating the Influence of Feature Sources for Malicious Website Detection

Dušan Sovilj,Ahmad Chaiban,Hazem Soliman,Xiaodong Lin,Geoff Salmon

doi:10.3390/app12062806

Abstract

Malicious websites in general, and phishing websites in particular, attempt to mimic legitimate websites in order to trick users into trusting them. These websites, often a primary method for credential collection, pose a severe threat to large enterprises. Credential collection enables malicious actors to infiltrate enterprise systems without triggering the usual alarms. Therefore, there is a vital need to gain deep insights into the statistical features of these websites that enable Machine Learning (ML) models to classify them from their benign counterparts. Our objective in this paper is to provide this necessary investigation, more specifically, our contribution is to observe and evaluate combinations of feature sources that have not been studied in the existing literature—primarily involving embeddings extracted with Transformer-type neural networks. The second contribution is a new dataset for this problem, GAWAIN, constructed in a way that offers other researchers not only access to data, but our whole data acquisition and processing pipeline. The experiments on our new GAWAIN dataset show that the classification problem is much harder than reported in other studies—we are able to obtain around 84% in terms of test accuracy. For individual feature contributions, the most relevant ones are coming from URL embeddings, indicating that this additional step in the processing pipeline is needed in order to improve predictions. A surprising outcome of the investigation is lack of content-related features (HTML, JavaScript) from the top-10 list. When comparing the prediction outcomes between models trained on commonly used features in the literature versus embedding-related features, the gain with embeddings is slightly above 1% in terms of test accuracy. However, we argue that even this somewhat small increase can play a significant role in detecting malicious websites, and thus these types of feature categories are worth investigating further.

Highlights

Malicious websites are often designed to host unsolicited content, such as adware, back doors, exploits, and phishing in order to deceive users on multiple levels, and cause the loss of billions yearly according to the IC3 2020 report [1]
In order to combat those drawbacks, more robust methods that employ Machine Learning (ML) techniques have been studied. These include training a multitude of ML classifiers using a standard feature-based approach [3–9], the usage of transformers and Convolutional Neural Networks (CNN) on raw URLs [10], the anti-phishing system that employs natural language processing (NLP)-based features [11], and ML-based solutions against typo-squatting attacks [12]
We evaluate several classic ML models on both datasets, with the addition of two more variations which have JavaScript features removed from the initial data

Summary

Introduction

Malicious websites are often designed to host unsolicited content, such as adware, back doors, exploits, and phishing in order to deceive users on multiple levels, and cause the loss of billions yearly according to the IC3 2020 report [1]. The most common form of detection in the field is through configuring Deny/Allow lists, which fails to adapt to new URLs and requires excessive manual attendance. Another method involves utilizing an Intrusion Detection/Prevention System (IDS/IPS), which falls short on several fronts [2]. In order to combat those drawbacks, more robust methods that employ Machine Learning (ML) techniques have been studied These include training a multitude of ML classifiers using a standard feature-based approach [3–9], the usage of transformers and CNNs on raw URLs [10], the anti-phishing system that employs NLP-based features [11], and ML-based solutions against typo-squatting attacks [12]

Methods

Results

Conclusion