Abstract

Social engineering attacks pose a major threat to an internet user’s sensitive information, such as credit card information and passwords. One of the most common of these attacks are phishing websites. These websites appear to be legitimate in hopes that a user will unknowingly input their sensitive information to the malicious website. This paper attempts to identify and characterize the coding style of phishing websites using machine learning models. We used web scraping to extract the HTML content of around 29,000 phishing websites. The phishing websites were collected from PhishTank, which publicly tracks such websites. To compare the HTML coding styles and syntax in phishing websites and legitimate websites, we used a dataset of around 36,000 legitimate websites. We eliminated websites with missing basic content. From the cleaned datasets of phishing and legitimate websites, we processed 10,800 websites’ source codes (5,400 websites per category), extracting 11 features from every website’s content. Our Random Forest model achieved the best accuracy of 94.16% in detecting phishing websites.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.