Abstract

Phishing is a social engineering attack that has been perpetuated for long and is still a prominent attack with an attending high number of victims. Through phishing, attackers can gain easy access to sensitive information about a company or an individual. This research compares the import of features such as lexical features, Domain Named Based features, HTML Features, and tokenization of URLs in detecting phishing URLs. Experimental procedures were designed to compare the efficiency of the four categories of features used separately on three machine learning models (K-Nearest Neighbour, Decision Tree, Logistic Regression) and five ensemble learning classifiers (Random Forest, Bagging, Stacking, Ada Boost, Gradient Boost). Results obtained show higher accuracy for experiments done using URL tokenization with stacking classifier with accuracy scores of 96% and 99.3% respectively for the two datasets used. Future study would be based on more dataset with larger sample size to provide a basis for generalisation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call