Cybercriminals create phishing websites that mimic legitimate websites to get sensitive information from companies, individuals, or governments. Therefore, using state-of-the-art artificial intelligence and machine learning technologies to correctly classify phishing and legitimate URLs is imperative. We report the results of applying deterministic and probabilistic neural network models to URL classification. Key achievements of this work are: (1) The development of a unique approach based on probabilistic neural networks that improves classification accuracy. (2) We show for the first time in URL phishing research that a machine learning model trained on a combination of open source and private datasets is successful in production. The dataset is constructed from open sources like Alexa, PhishTank, or OpenPhish and, most importantly, real-world production data from EasyDMARC. The daily validation of the model using daily reported URL data and corresponding labels, both from open-source platforms and private production, reach on average a 97% accuracy on the validation dataset, labeled by PhishTank, OpenPhish and EasdDMARC where possible mislabeled data can not be excluded and was not possible to check due to large number of URLs. Feature engineering was done without third-party dependencies. Lastly, the evaluation of both deterministic and probabilistic models shows high accuracy on short and long URLs, where short URLs are defined as having less than 50 characters.
Read full abstract