Web2Vec: Phishing Webpage Detection Method Based on Multidimensional Features Driven by Deep Learning

Jian Feng,Ou Ye,Jingzhou Han,Lianyang Zou

doi:10.1109/access.2020.3043188

Abstract

Phishing is a kind of online attack that attempts to defraud sensitive information of network users. Current phishing webpage detection methods mainly use manual feature collection, and there are problems that feature extraction is complicated and the possible correlation between features cannot be avoided. To solve the problems, a new phishing webpage detection model is proposed, among which the main components are automatic learning representations from multi-aspects features through representation learning and extracting features by hybrid deep learning network. Firstly, the model treats URL, HTML page content, and DOM (Document Object Model) structure of webpages as character sequences respectively, and uses representation learning technology to automatically learn the representation of the webpages; then, sends multiple representations to a hybrid deep learning network composed of a convolutional neural network and a bidirectional long and short-term memory network through different channels to extract local and global features, and use the attention mechanism to strengthen the influence of important features; finally, the output of multiple channels is fused to realize classification prediction. Through four sets of experiments to verify the detection effect of the model, the results show that the overall classification effect of the model is better than the existing classic phishing webpage detection methods, the accuracy reaches 99.05%, and the false positive rate is only 0.25%. It is proved that the strategies of extracting webpage features from all aspects through representation learning and hybrid deep learning network can effectively improve the detection effect of phishing webpages.

Highlights

Phishing is a kind of attack that attackers use social engineering and technical disguise and other attack methods to cheat users to visit fake webpages by sending deceptive spam, real-time communication messages, etc., in order to induce users to disclose their personal identity, financial account, and other sensitive information
Traditional phishing webpage detection methods are mainly based on the analysis and modeling of manually extracted multi-source features such as URL features, page content features, and webpage structural features, which once showed strong resistance to phishing attacks
EXPERIMENTAL RESULTS AND ANALYSIS In order to verify the effectiveness of the Web2Vec model, four sets of experiments are designed to try to answer the following questions: 1) Question 1: Compared with the classic phishing webpage detection methods, how effective is the detection of the Web2Vec model?

Summary

Introduction

Phishing is a kind of attack that attackers use social engineering and technical disguise and other attack methods to cheat users to visit fake webpages by sending deceptive spam, real-time communication messages, etc., in order to induce users to disclose their personal identity, financial account, and other sensitive information. In the offensive and defensive game with phishing, phishing webpage analysis and detection technology have been continuously developed, and the traditional phishing webpage detection methods such as blacklist-based [2], heuristic-based [3,4], visual similarity-based [5,6], and machine learning-based [7,8,9,10,11] methods are proposed, and emerging detection methods based on deep learning [13,14,15,16,17,18,19,20,21] are proposed in recent years. Traditional phishing webpage detection methods are mainly based on the analysis and modeling of manually extracted multi-source features such as URL features, page content features, and webpage structural features, which once showed strong resistance to phishing attacks. As the iterative update speed of phishing webpages is accelerated and the VOLUME XX, 2017

Methods

Results

Conclusion