Characterizing Coding Style of Phishing Websites Using Machine Learning Techniques

May Almousa,Ruben Furst,Mohd Anwar

doi:10.1109/transai54797.2022.00025

Abstract

Social engineering attacks pose a major threat to an internet user’s sensitive information, such as credit card information and passwords. One of the most common of these attacks are phishing websites. These websites appear to be legitimate in hopes that a user will unknowingly input their sensitive information to the malicious website. This paper attempts to identify and characterize the coding style of phishing websites using machine learning models. We used web scraping to extract the HTML content of around 29,000 phishing websites. The phishing websites were collected from PhishTank, which publicly tracks such websites. To compare the HTML coding styles and syntax in phishing websites and legitimate websites, we used a dataset of around 36,000 legitimate websites. We eliminated websites with missing basic content. From the cleaned datasets of phishing and legitimate websites, we processed 10,800 websites’ source codes (5,400 websites per category), extracting 11 features from every website’s content. Our Random Forest model achieved the best accuracy of 94.16% in detecting phishing websites.

Full Text