Abstract

While machine learning progress is advancing the detection of malicious URLs, advanced Transformers applied to URLs face difficulties in extracting local information, character-level information, and structural relationships. To address these challenges, we propose a novel approach for malicious URL detection, named TransURL, that is implemented by co-training the character-aware Transformer with three feature modules—Multi-Layer Encoding, Multi-Scale Feature Learning, and Spatial Pyramid Attention. This special Transformer allows TransURL to extract embeddings that contain character-level information from URL token sequences, with three feature modules contributing to the fusion of multi-layer Transformer encodings and the capture of multi-scale local details and structural relationships. The proposed method is evaluated across several challenging scenarios, including class imbalance learning, multi-classification, cross-dataset testing, and adversarial sample attacks. The experimental results demonstrate a significant improvement compared to the best previous methods. For instance, it achieved a peak F1-score improvement of 40% in class-imbalanced scenarios, and exceeded the best baseline result by 14.13% in accuracy in adversarial attack scenarios. Additionally, we conduct a case study where our method accurately identifies all 30 active malicious web pages, whereas two pior SOTA methods miss 4 and 7 malicious web pages respectively. The codes and data are available at: https://github.com/Vul-det/TransURL/.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.