Abstract

In this chapter, we design a framework for detecting suspicious URLs by considering the URL features without requiring the content of the web page. We distinguish malignant web pages based on different discriminative and effective URL features, including lexical features, HTTP header information–based features, host-based features, geographical features and network features whose predictive power is high and improves performance significantly. Moreover, our approach uses both batch machine learning (ML) algorithms and ensemble machine learning classifiers (EMLCs) to identify the suspicious URL. EMLCs use multiple weak learners that are trained on different training examples to enhance the model performance effectively (TRAGHA 2019). We have compared the batch ML algorithms with ensemble models experimentally and ascertained that the ensemble approach outperforms the batch ML classifiers. We have extended our previous approach (Mohanty et al. 2020), where we used only some batch learning classifiers and a few URL features to identify URLs as either malignant or safe and obtained the classifier; the random forest (RF) model achieved the highest accuracy at 95%. Our proposed approach is evaluated against a training dataset that contains some safe and some malicious URLs and shows that the ensemble techniques obtain a TPR (true positive rate) of 0.98, FPR (false positive rate) of 0.01 and accuracy of 98.66%, precision of 0.95, recall of 100%, F1 score of 96% and AUC of 0.982.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call