Abstract
Phishing has become a prevailing method for attackers to steal users’ private data and commit fraud, posing a serious threat to Internet users. How to detect phishing websites has attracted great interests from both academia and industry. A popular approach is to use support vector machine (SVM) to detect phishing websites. However, this approach relies on extracting features designated by experts, and the prediction effectiveness of the model is greatly affected by the quality of feature extraction. In addition, it cannot handle features that are not identifiable. Deep learning methods therefore become popular as they do not require manual feature engineering. However, many deep learning methods can only learn feature information of uniform resource locators (URLs) at the character level, while ignoring the intrinsic connections of words. To address these limitations, we propose a novel highway deep pyramid convolution neural network (HDP-CNN), a deep convolutional network that combines character-level and word-level representation information. HDP-CNN first receives the URL string sequences as input, then performs character-level embedding and word-level embedding respectively. Afterward, it uses the Highway network to connect the character-level embedding representation and word-level embedding representation of the URL and extracts local features of different sizes from the region embedding layer. Finally, it passes them into the designed deep pyramid structure network to capture the global representation of the URL. Our experiments illustrate that the information expressed by embedding vectors of different granularities has subtle differences. By combining embedding feature information of different granularities, HDP-CNN exhibits better performance than methods based on single embedding feature information. In our experiments, we construct an imbalanced dataset that has the ratio of benign websites to phishing websites is close to 5:1. The experimental results demonstrate that our method outperforms other methods, with accuracy at 98.30%, true positive rate (TPR) at 99.18%, and true negative rate (TNR) at 94.34%.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.