Abstract

In many real-world scenarios such as fraud detection, phishing website classification, etc., the training datasets normally have skewed class distribution with majority (e.g., legitimate websites) class samples overwhelming the minority (e.g., phishing websites) class samples. The machine learning algorithms assume balanced class distributions and are biased towards the majority (uninteresting) class ignoring the minority (interesting) class (es). For handling class imbalance, researchers have proposed solutions both at the (i) data-level and (ii) algorithm-level. In this study we propose a dual approach for handling class imbalance in phishing website classification both at the data and algorithm. We propose a novel hybrid resampling approach KMeansSMOTENCR which balances the dataset by first oversampling the minority class using KMeans Synthetic Minority Oversampling Technique (KMeansSMOTE) (Douzas et al. in Inf Sci 465:1–20, 2018 [1]) followed by Neighborhood Clearing Rule (NCR) (Laurikkala in AIME, LNAI 2001. Springer, Berlin, pp 63–66, 2001 [2]) under sampling technique as the data cleaning approach to take care of the possibility of synthetic minority class samples invading the majority class samples. Finally, we employed Cost-Sensitive Random Forest (CS-RF), Cost-Sensitive Extreme Gradient Boosting (CS-XGB), Cost-Sensitive Support Vector Machine (CS-SVM), and Cost-Sensitive Logistic Regression (CS-LR) classifiers as algorithm-level balancing approach. We evaluated the performance of CS-RF, CS-XGBoost, CS-SVM, and CS-LR classifiers on (i) Original-(Imbalanced), (ii) NCR-(Balanced), (iii) KMeansSMOTE-(Balanced), and (iv) KMeansSMOTENCR-(Balanced) datasets. In Sect. 4 Result and Discussion we demonstrate that the highest ROC_AUC, F1 and GMean are obtained from our proposed method which outperforms the other three. To the best of our knowledge and belief our novel hybrid resampling approach ‘KMeansSMOTENCR’ has not been published in the existing studies as of now.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.