Abstract
As technologies advance, so do the attacks on them. Cybersecurity plays a significant role in society to protect everyone. Malicious URLs are links designed to promote scams, attacks, and frauds. Companies often have web filtering algorithms that will blacklist specific URLs as malicious; however, due to privacy concerns, they will not give outside entities access to their cybersecurity data. Unfortunately, this lack of data creates a dire need for more data in cybersecurity research and machine learning applications. This paper proposes using machine learning to generate new synthetic URLs characteristically indistinguishable from the data they replace. To do this two character-level long short-term memory (LSTM) models were trained, one to generate malicious URLs and one to generate benign URLs. To assess the quality of the synthetic data two tests were performed. (1) Classify the URLs into malicious and benign to ensure the characteristics of the original data were preserved. (2) Use the Levenstein ratio to check the similarity between the real and synthetic URLs to ensure sufficient anonymization. The results from the classification test show that the synthetic data classifier only slightly underperformed the real data classifier; however, with having accuracy, precision, recall, sensitivity, and specificity above 99%, it can be concluded that the characteristics of the malicious and benign URLs were preserved. The Levenstein ratio tests showed a mean of 67% and 79% similarity for the benign and malicious URLs, respectively. In the end, the character-level LSTM model successfully generated an anonymized, synthetic dataset, that was characteristically similar to the original, which could pave the way for the publication of many more datasets in this way.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.