Abstract

In this paper, we present a stacking model to detect phishing webpages using URL and HTML features. In terms of features, we design lightweight URL and HTML features and introduce HTML string embedding without using the third-party services, making it possible to develop real-time detection applications. Furthermore, we devise a stacking model by combining GBDT, XGBoost and LightGBM in multiple layers, which enables different models to be complementary, thus improving the performance on phishing webpage detection. In particular, we collect two real-world datasets for evaluations, named as 50K-PD and 50K-IPD, respectively. 50K-PD contains 49,947 webpages with URLs and HTML codes. 50K-IPD contains 53,103 webpages with screenshots in addition to URLs and HTML codes. The proposed approach outperforms quite a few machine learning models on multiple metrics, achieving 97.30% on accuracy, 4.46% on missing alarm rate, and 1.61% on false alarm rate on 50K-PD dataset. On 50K-IPD dataset, the proposed approach achieves 98.60% on accuracy, 1.28% on missing alarm rate, and 1.54% on false alarm rate.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call