HTTP header based phishing attack detection using machine learning

Sanjeev Shukla,Manoj Misra,Gaurav Varshney

doi:10.1002/ett.4872

Abstract

AbstractIn the past, many techniques like blacklisting/whitelisting, third‐party, search engine, visual similarity, heuristic, URL features, and website content were used for anti‐phishing. Search engine‐based, third‐party assisted tools and blacklist/whitelist fail to identify new phishing attacks resulting in high FPR. Heuristic and visual similarity approaches are slow, whereas URL and web content‐based techniques do not mimic the dynamic content of current websites and hence cannot stop zero‐day attacks. A study was conducted to understand the critical features used in the past for anti‐phishing, and we found 16 HTTP header features that were novel. In this paper, we have developed a real‐time, highly scalable, feature‐rich anti‐phishing detection technique based on ML that extracts the HTTP headers (predominantly security headers) from web pages to identify them as legitimate or phished. It is observed that phishing sites are short‐lived and are created to achieve a specific objective, like stealing the credential of a user. Once the goal is met, the sites are pulled down immediately. Hence these sites do not take pain to use the security features of web technology and only focus on making the site as similar as possible to the original website. Test results based on our novel features show high accuracy of 97.8% with an average response time of 1.57 s. We have created multiple datasets for different scenarios, like a dataset for website creation through phishing tools and a new dataset for testing unseen phishing attacks. The results thus obtained show detection accuracy of 99% and 95%, respectively.

Full Text