This paper presents a discreate mathematical modelling of cybersecurity phishing attack detection methodologies, emphasizing the crucial role of continual advancements in detection methods amidst the pervasive threat of phishing attacks in the cybersecurity landscape. Leveraging mathematical modeling and machine learning algorithms, the study employs three distinct datasets—Mendeley, URL tokenized, and a merged dataset integrating both. Multiple machine learning algorithms, including Logistic Regression, k-Nearest Neighbors, Support Vector Machines, Random Forest, Gradient Boosting Machines, Neural Networks, CatBoost, and XGBoost, are systematically applied to evaluate their efficacy. In the original Mendeley dataset, XGBoost achieves a top accuracy of 97.24%, along with CatBoost and Random Forest exceeding 97%. Post-preprocessing, CatBoost leads with an accuracy of 97.28%, showcasing superior precision, sensitivity, and F-score. Despite slight accuracy reductions post-preprocessing, models consistently achieve over 94% accuracy on the preprocessed Mendeley dataset, highlighting the substantial impact of preprocessing. Tokenized URLs exhibit comparatively lower performance, with the highest accuracy at 91.95%, emphasizing the challenges associated with this approach. The combined dataset proves optimal for most models, with XGBoost and SVM achieving the highest overall accuracy at 97.68%. SVM excels in sensitivity and specificity, while XGBoost excels in precision. The merged dataset significantly enhances accuracy, sensitivity, specificity, and precision, underscoring its pivotal role in refining predictive capabilities for identifying phishing websites. The results section provides a detailed overview of machine learning model performance on different datasets. CatBoost emerges as a standout performer on the preprocessed Mendeley dataset. The tokenized URLs offer valuable insights into associated challenges, and the combined dataset proves effective for various models. Confusion matrices, ROC curves, and Precision-Recall curves provide nuanced perspectives on model behavior, emphasizing the need for ongoing refinement and investigation into misclassification patterns to enhance model effectiveness in combating phishing threats.
Read full abstract