Malicious and Benign URL Dataset Generation Using Character-Level LSTM Models

Spencer Vecile,Katarina Grolinger,Kyle Lacroix,Jagath Samarabandu

doi:10.1109/dsc54232.2022.9888835

Abstract

As technologies advance, so do the attacks on them. Cybersecurity plays a significant role in society to protect everyone. Malicious URLs are links designed to promote scams, attacks, and frauds. Companies often have web filtering algorithms that will blacklist specific URLs as malicious; however, due to privacy concerns, they will not give outside entities access to their cybersecurity data. Unfortunately, this lack of data creates a dire need for more data in cybersecurity research and machine learning applications. This paper proposes using machine learning to generate new synthetic URLs characteristically indistinguishable from the data they replace. To do this two character-level long short-term memory (LSTM) models were trained, one to generate malicious URLs and one to generate benign URLs. To assess the quality of the synthetic data two tests were performed. (1) Classify the URLs into malicious and benign to ensure the characteristics of the original data were preserved. (2) Use the Levenstein ratio to check the similarity between the real and synthetic URLs to ensure sufficient anonymization. The results from the classification test show that the synthetic data classifier only slightly underperformed the real data classifier; however, with having accuracy, precision, recall, sensitivity, and specificity above 99%, it can be concluded that the characteristics of the malicious and benign URLs were preserved. The Levenstein ratio tests showed a mean of 67% and 79% similarity for the benign and malicious URLs, respectively. In the end, the character-level LSTM model successfully generated an anonymized, synthetic dataset, that was characteristically similar to the original, which could pave the way for the publication of many more datasets in this way.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Malicious and Benign URL Dataset Generation Using Character-Level LSTM Models

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

A Hybrid approach combining blocklists, machine learning and deep learning for detection of malicious URLs
Bronjon Gogoi ... Tasiruddin Ahmed
-
Bronjon Gogoi, et. al.Bronjon Gogoi ... Tasiruddin Ahmed
15 Jul 2022
15 Jul 2022

Feature-based Malicious URL and Attack Type Detection Using Multi-class Classification
...
-
, et. al. ...
20 Mar 2018
20 Mar 2018

Machine Learning for Malicious URL Detection
Gold Wejinya ... Sajal Bhatia
-
Gold Wejinya, et. al.Gold Wejinya ... Sajal Bhatia
15 Dec 2020
15 Dec 2020

The Detection Method for Two-Dimensional Barcode Malicious URLs Based on the Hash Function
Jiang Xuan ... Li Yongzhen
-
Jiang Xuan, et. al.Jiang Xuan ... Li Yongzhen
01 Jul 2016
01 Jul 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Malicious and Benign URL Dataset Generation Using Character-Level LSTM Models

Abstract

Talk to us

Similar Papers