Abstract
Detecting phishing web pages is a challenging task. The existing detection method for phishing web page based on DOM (Document Object Model) is mainly aiming at obtaining structural characteristics but ignores the overall representation of web pages and the semantic information that HTML tags may have. This paper regards DOMs as a natural language with Doc2Vec model and learns the structural semantics automatically to detect phishing web pages. Firstly, the DOM structure of the obtained web page is parsed to construct the DOM tree, then the Doc2Vec model is used to vectorize the DOM tree, and to measure the semantic similarity in web pages by the distance between different DOM vectors. Finally, the hierarchical clustering method is used to implement clustering of web pages. Experiments show that the method proposed in the paper achieves higher recall and precision for phishing classification, compared to DOM-based structural clustering method and TF-IDF-based semantic clustering method. The result shows that using Paragraph Vector is effective on DOM in a linguistic approach.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.