A Detection Method for Phishing Web Page Using DOM-Based Doc2Vec Model

Jian Feng,Yuqiang Qiao,Ying Zhang

doi:10.20532/cit.2020.1004899

Jian Feng, Yuqiang Qiao + Show 1 more

Open Access

PDF Available

https://doi.org/10.20532/cit.2020.1004899

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Detecting phishing web pages is a challenging task. The existing detection method for phishing web page based on DOM (Document Object Model) is mainly aiming at obtaining structural characteristics but ignores the overall representation of web pages and the semantic information that HTML tags may have. This paper regards DOMs as a natural language with Doc2Vec model and learns the structural semantics automatically to detect phishing web pages. Firstly, the DOM structure of the obtained web page is parsed to construct the DOM tree, then the Doc2Vec model is used to vectorize the DOM tree, and to measure the semantic similarity in web pages by the distance between different DOM vectors. Finally, the hierarchical clustering method is used to implement clustering of web pages. Experiments show that the method proposed in the paper achieves higher recall and precision for phishing classification, compared to DOM-based structural clustering method and TF-IDF-based semantic clustering method. The result shows that using Paragraph Vector is effective on DOM in a linguistic approach.

Full Text